Site Reliability Specialist

Singapore, Singapore

Job Description


Position Overview

The Reliability Lead will support the reliability principal with senior management in strategy discussion for application & system improvement, and will also manage the reliability team.

He/She will ensure that the existing site reliability engineering (SREs) initiatives, such as monitoring availability, uplifting capability and automoation are on track. He/She will also assist the Reliability Principal and Engineering Teams in reviewing the reliability program to take stock of success and challenges and refine the program. He/She will be in charge of the management reports that describe the current situation and recommend the next steps.

As Lead of the Reliability team, which consists of experienced engineers and product specialists, he/she will be coaching the engineering teams and service management teams to help them improve in application reliability with tools, monitoring, prevention activities. He/She will collaborate with the applications, incident management (IOC) and infrastructure support teams to identify and implement procedures, tools and scripts that will improve reliability and reduce downtime while improving automation.

Role & Responsibilities

  • Strive for automation either by coding it or by leading and influencing engineers to build systems that are easy to run in production
  • Identify significant projects that result in substantial cost savings
  • Identify changes for the production architecture from the reliability, performance and availability perspective with a data driven approach
  • Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make operating cost cheaper to run for all our customers
  • Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents
  • Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives
  • Know a domain really well and radiate that knowledge through recorded demos, discussions in DNA (Design and Automation) meetings, or Incident Reviews
  • Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again
  • Set an example for team of SREs with positive and inclusive leadership and discussion on work
  • Show ownership of a major part of the infrastructure
  • De-escalate any conflicts inside the team
Requirements
  • Bachelor\'s degree in computer science or other highly technical, scientific discipline
  • Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C#, and JavaScript
  • Experience with infrastructure technologies like Operating Systems (Windows and Linux), networking, storage, virtualisation
  • Familiar with testing automation tools
  • Have a sense of urgency to deliver & iterate fast
  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Previous success in software engineering
  • Have a sense of urgency to deliver & iterate fast
  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Have a sense of urgency to deliver & iterate fast
  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Have a sense of urgency to deliver & iterate fast
  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Specialise in 1 or 2 of the following:
  • Great software engineer and able to code in resolving defects or vulnerabilities of our systems
  • Use infrastructure automation tools such as Chef or Ansible to efficiently manage our infrastructure
  • Implement ""Infrastructure as Code"" using Terraform and CI/CD for automation
  • Load balancing and high availability architecture of application including Proxies and CDN through the use of F5
  • Openshift and containerizing our system
  • Administer and manage high-availability, high-performance Microsoft SQL Server or Oracle cluster
  • Monitoring and Metrics in Dynatrace, ELK or eG and integrations with Dynatrace / ITSM
  • Logging infrastructure
  • Key, certificate and secrete management
  • Backend storage management and scaling
  • Disaster Recovery and High Availability strategy
Apply Now

Click Enter to update the description of Apply Now
NOTE: It only takes a few minutes to apply for a meaningful career in HealthTech - GO FOR IT!!

#LI-IHIS11

M-2022-2160

IHiS

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD1351856
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Singapore, Singapore
  • Education
    Not mentioned