Design and implement monitoring solutions using APM products.
Create and maintain monitoring dashboards to provide real-time visibility into system health and performance.
Collaborate with development and operations teams to define and implement alerting rules based on established best practices and specific system requirements.
Monitor system performance, availability, and capacity to proactively identify and address potential issues.
Continuously analyze monitoring data to identify opportunities for optimization and efficiency improvements.
Collaborate with cross-functional teams to ensure the reliability, scalability, and performance of our infrastructure.
Document monitoring and alerting configurations, processes, and best practices.
Requirements
Bachelor\'s degree in Computer Science, related technical discipline, or equivalent practical experiences.
Proven experience as a Site Reliability Engineer (SRE) or a similar role with a focus on monitoring and alerting.
Proficiency with APM tools and technologies such as SolarWinds, IBM Instana, Prometheus, Grafana, etc.
Experience in creating and maintaining monitoring dashboards and writing alerting rules.
Understanding of cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes) is a plus.
Good communication and teamwork skills..
Shortlisted candidates will be offered a 1 Year Agency contract employment.