We are looking for a talented and motivated Senior Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on experience with Terraform, Ansible, and log analytics tools, combined with proficiency in working with Linux, Kubernetes, and AIOps platforms
Key Responsibilities
Most critically, implement and maintain observability solutions using ELK, Grafana suite (e.g. Loki, Tempo, Mimir, and Prometheus), ensuring complete monitoring, logging, and tracing capabilities.
Design, deploy, and manage scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible and GitHub.
Leverage OpenTelemetry to instrument applications and collect telemetry data for performance insights and system health.
Automate configuration and operational tasks using Ansible to reduce manual efforts.
Manage and monitor Kubernetes clusters and Linux-based systems to ensure optimal performance and availability.
Integrate and support SNMP-based Network Performance Monitoring (NPM) tools like SolarWinds, SevOne, or OpsRamp for network observability.
Implement event management systems and AIOps platforms for proactive incident detection, correlation, and automated resolution.
Collaborate with DevOps teams to build and maintain CI/CD pipelines for continuous integration and delivery.
Perform incident management, conduct post-incident reviews, and drive long-term improvements through root-cause analysis.
Maintain detailed documentation for infrastructure, automation workflows, troubleshooting procedures, and operational best practices.
Qualifications
Required Expertise and Experience
At least 6 years of experience in SRE, DevOps, or a related engineering role.
Proficiency in Infrastructure as Code (IaC) using Terraform to manage complex infrastructure.
Hands-on experience with log analytics and observability tools, especially ELK (Elasticsearch, Logstash, Kibana) and the Grafana suite (Loki, Tempo, Mimir, Prometheus).
Knowledge and experience with OpenTelemetry for distributed tracing and telemetry collection.
Experience working with Kubernetes clusters and Linux-based systems in production environments.
Expertise in automation using Ansible to streamline configuration and deployment processes.
Knowledge of SNMP-based NPM tools such as SolarWinds, SevOne, or OpsRamp for network monitoring.
Experience with AIOps platforms for event correlation and automated incident management.
Strong background in CI/CD practices, with hands-on involvement in building pipelines for software delivery.
Required Skills and Qualifications
MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.