The Senior SRE will be responsible for the reliability, scalability, and performance of enterprise-grade Red Hat OpenShift Container Platforms (OCP) and observability solutions for systems and network deployed across hybrid cloud environments.
This role combines deep platform engineering, automation, performance optimization, observability, and security to ensure mission-critical workloads achieve high availability (99.99%+), compliance, and operational excellence.
The engineer will act as a technical authority for observability solutions, collaborating with cloud architects, security teams, DevOps engineers, and business stakeholders to design, implement, and optimize SRE practices and observability platforms that meet stringent enterprise SLAs. Key Responsibilities
Platform Reliability & Performance
Own end-to-end reliability and scalability of multi-cluster OpenShift environments (on-premises and public cloud) supporting containerized enterprise workloads.
Manage multi-tenant observability solutions including centralized log analytics, security event monitoring, network and business observability.
Conduct performance engineering for both OCP clusters and ELK components
Lead capacity planning
Automation & Resilience Engineering
Implement Infrastructure-as-Code and GitOps pipelines using Terraform, Ansible, Jenkins, and reproducible OCP and ELK deployments.
Build self-healing and auto-remediation workflows leveraging Kubernetes operators, ServiceNow ITOM/AIOps integrations, and custom runbooks.
Design and enforce automated backup/restore, failover, and disaster recovery strategies across multiple data centers and cloud regions.
Develop SLO/SLI dashboards for performance, latency, error budgets, and saturation metrics using Prometheus, Grafana, and Kibana.
Observability & Incident Response
Drive adoption of ALErTS metrics to align reliability KPIs with business outcomes.
Standardize log ingestion pipelines with Logstash/Beats/Fluent Bit across heterogeneous infrastructure.
Lead root cause analysis (RCA) for complex production issues and establish post-incident blameless retrospectives with actionable follow-ups.