Site Reliability Engineer

SG, Singapore

Job Description

Date:



17 Dec 2025



Location:



SG



Company:



StarHub Ltd



Job Purpose

The Senior SRE will be responsible for the reliability, scalability, and performance of enterprise-grade Red Hat OpenShift Container Platforms (OCP) and observability solutions for systems and network deployed across hybrid cloud environments.

This role combines deep platform engineering, automation, performance optimization, observability, and security to ensure mission-critical workloads achieve high availability (99.99%+), compliance, and operational excellence.



The engineer will act as a technical authority for observability solutions, collaborating with cloud architects, security teams, DevOps engineers, and business stakeholders to design, implement, and optimize SRE practices and observability platforms that meet stringent enterprise SLAs. Key Responsibilities

Platform Reliability & Performance



Own end-to-end reliability and scalability of multi-cluster OpenShift environments (on-premises and public cloud) supporting containerized enterprise workloads.



Manage multi-tenant observability solutions including centralized log analytics, security event monitoring, network and business observability.



Conduct performance engineering for both OCP clusters and ELK components



Lead capacity planning



Automation & Resilience Engineering



Implement Infrastructure-as-Code and GitOps pipelines using Terraform, Ansible, Jenkins, and reproducible OCP and ELK deployments.



Build self-healing and auto-remediation workflows leveraging Kubernetes operators, ServiceNow ITOM/AIOps integrations, and custom runbooks.



Design and enforce automated backup/restore, failover, and disaster recovery strategies across multiple data centers and cloud regions.



Develop SLO/SLI dashboards for performance, latency, error budgets, and saturation metrics using Prometheus, Grafana, and Kibana.



Observability & Incident Response



Drive adoption of ALErTS metrics to align reliability KPIs with business outcomes.



Standardize log ingestion pipelines with Logstash/Beats/Fluent Bit across heterogeneous infrastructure.



Lead root cause analysis (RCA) for complex production issues and establish post-incident blameless retrospectives with actionable follow-ups.



Security, Compliance & Governance



Ensure secure cluster configurations (CIS-compliant hardening, RBAC/ABAC policies, secrets management with Vault/KMS).



Enforce data privacy and retention policies for logs and traces and support audit readiness



Partner with InfoSec to perform vulnerability scanning, patch automation, and zero-trust network policy enforcement.



Innovation & Continuous Improvement



Champion SRE best practices and cost-optimization efforts.



Mentor junior engineers and contribute to knowledge-sharing playbooks for SRE and observability runbooks.



Evaluate and adopt emerging cloud-native observability tools (e.g., OpenTelemetry, Elastic Agent, Loki, Tempo) to modernize telemetry pipelines

Qulifications

Education



Bachelor's degree (or higher) in Computer Science, Information Systems, or a related engineering discipline.



Experience



5 years in enterprise infrastructure/DevOps/SRE roles, with at least 2 years managing Red Hat Linux/ Kubernetes/OpenShift clusters in production.



Proven expertise in designing and operating large-scale ELK clusters ( >10TB/day log ingestion) for mission-critical workloads.



Strong experience with Linux systems administration (RHEL/Ubuntu), networking (overlay networks, CNI plugins), container runtime security, and storage backends (Ceph, NFS, SAN).



Track record of leading incident response and RCA for high-severity (P1/P2) production outages.



Demonstrated experience in performance optimization and cost-efficient scaling of container and observability platforms.



Technical Skills



Container & Cloud Platforms:

Kubernetes, Red Hat OCP, Docker, Podman, Helm, Operators.



Observability Stack:

ELK, Beats/Fluent Bit, OpenTelemetry, Prometheus, Grafana.



Automation & CI/CD:

Terraform, Ansible, Jenkins, GitHub Actions, GitOps.



Programming/Scripting:

Python, Go, or Bash for automation and custom operators.



Security & Compliance:

Vault/KMS, CIS Benchmarks, RBAC/OPA, TLS/mTLS,



Soft Skills



Analytical mindset with strong troubleshooting and problem-solving abilities.



Excellent communication skills for cross-functional collaboration (Cloud, Security, DevOps, Application teams).



Leadership qualities for mentoring, incident command, and driving technical decisions.



Preferred Certifications



Any Red Hat Certification in Linux and OpenShift Administration or Kubernetes CKA/CKS/CKAD.



Elastic Certified Engineer or equivalent ELK-related certification.



AWS/GCP/Azure Certified SysOps/Architect (for hybrid-cloud OCP deployments).



HashiCorp Terraform Associate, Ansible Automation, or equivalent.



ITIL v4 Foundation and/or SRE Foundation/Practitioner certifications



To APPLY NOW, click on Skye!

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD1705474
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    SG, Singapore
  • Education
    Not mentioned