Site Reliability Engineer

SG, Singapore

Job Description

Job Summary:



We are seeking a

Senior Site Reliability Engineer (SRE)

with 10-15 years of proven experience in building, managing, and maintaining highly available, scalable, and secure infrastructure across

multi-cloud

and

hybrid cloud

environments--including

on-premises data centers

.


The ideal candidate will have deep knowledge of

SRE principles

, strong hands-on experience in

automation

,

observability

,

incident response

, and

infrastructure resilience

, and the ability to architect solutions that span

cloud and traditional data center

environments.


Key Responsibilities:



Design, implement, and manage

reliable and scalable systems

across

public clouds (AWS, Azure, GCP)

and

on-premises data centers

. Apply

SRE best practices

--including

SLIs, SLOs, error budgets, incident management, and postmortems

--across cloud and non-cloud environments. Develop and maintain

Infrastructure as Code (IaC)

using tools like Terraform, Ansible, or CloudFormation. Drive

automation

for deployment, scaling, monitoring, and infrastructure management. Implement and enhance

observability practices

(monitoring, logging, tracing) using tools like Prometheus, Grafana, ELK, Datadog, New Relic, etc. Work with application teams to ensure

high availability

,

performance

, and

cost optimization

across hybrid environments. Lead and participate in

on-call rotations

and improve overall

incident response

processes. Collaborate with security and compliance teams to enforce

best practices in data protection

, access control, and system hardening in hybrid setups. Evaluate and recommend emerging tools and technologies for

resilience engineering

,

disaster recovery

, and

infrastructure modernization

.

Required Qualifications:



10-15 years

of experience in SRE, DevOps, or infrastructure engineering roles. Proven experience managing infrastructure in

multi-cloud (AWS, Azure, GCP)

and

hybrid cloud/on-prem environments

. Solid understanding of

networking, load balancing, storage, virtualization, and container orchestration

(Kubernetes, Docker). Strong scripting and programming skills (e.g., Python, Go, Bash). Experience with

CI/CD pipelines

, tools like Jenkins, GitLab CI, ArgoCD, etc. In-depth knowledge of

SRE methodologies

and real-world application of SLAs, SLOs, and error budgets. Hands-on experience with

monitoring and observability stacks

. * Strong analytical and troubleshooting skills for

production incidents

across complex, distributed systems.

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD1635998
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    SG, Singapore
  • Education
    Not mentioned