Site Reliability Engineer Job in StarHub

Site Reliability Engineer

SG, Singapore

Apply Now

Job Description

We are looking for a talented and motivated Senior Site Reliability Engineer (SRE) to join our team. This role requires a mix of infrastructure expertise, hands-on observability experience, and DevOps skills. As an SRE, you will be instrumental in building reliable, scalable, and efficient systems. The ideal candidate will have hands-on experience with Terraform, Ansible, and log analytics tools, combined with proficiency in working with Linux, Kubernetes, and AIOps platforms

Key Responsibilities

Most critically, implement and maintain observability solutions using ELK, Grafana suite (e.g. Loki, Tempo, Mimir, and Prometheus), ensuring complete monitoring, logging, and tracing capabilities.

Design, deploy, and manage scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform, Ansible and GitHub.

Leverage OpenTelemetry to instrument applications and collect telemetry data for performance insights and system health.

Automate configuration and operational tasks using Ansible to reduce manual efforts.

Manage and monitor Kubernetes clusters and Linux-based systems to ensure optimal performance and availability.

Integrate and support SNMP-based Network Performance Monitoring (NPM) tools like SolarWinds, SevOne, or OpsRamp for network observability.

Implement event management systems and AIOps platforms for proactive incident detection, correlation, and automated resolution.

Collaborate with DevOps teams to build and maintain CI/CD pipelines for continuous integration and delivery.

Perform incident management, conduct post-incident reviews, and drive long-term improvements through root-cause analysis.

Maintain detailed documentation for infrastructure, automation workflows, troubleshooting procedures, and operational best practices.

Qualifications

Required Expertise and Experience

At least 6 years of experience in SRE, DevOps, or a related engineering role.

Proficiency in Infrastructure as Code (IaC) using Terraform to manage complex infrastructure.

Hands-on experience with log analytics and observability tools, especially ELK (Elasticsearch, Logstash, Kibana) and the Grafana suite (Loki, Tempo, Mimir, Prometheus).

Knowledge and experience with OpenTelemetry for distributed tracing and telemetry collection.

Experience working with Kubernetes clusters and Linux-based systems in production environments.

Expertise in automation using Ansible to streamline configuration and deployment processes.

Knowledge of SNMP-based NPM tools such as SolarWinds, SevOne, or OpsRamp for network monitoring.

Experience with AIOps platforms for event correlation and automated incident management.

Strong background in CI/CD practices, with hands-on involvement in building pipelines for software delivery.

Required Skills and Qualifications

Technical Skills:

Observability and data infrastructure architecture

Implementing observability with ELK, Grafana suite, and OpenTelemetry.

Automation using Ansible.

Kubernetes orchestration and Linux system administration.

Expertise in SNMP-based NPM tools (SolarWinds, SevOne, or OpsRamp).

Experience with AIOps and event management platforms.

Soft Skills:

Strong problem-solving abilities with a focus on automation and continuous improvement.

Excellent communication and collaboration skills across cross-functional teams.

Ability to thrive in a dynamic, fast-paced environment and manage multiple priorities.

Preferred Knowledge:

Familiarity with GitOps practices for infrastructure management.

Understanding of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Security awareness and experience implementing secure infrastructure.