Vp, Problem & Knowledge Management Lead, Sre & Governance, Group Technology

SG, Singapore

Job Description

The Role:



This position is for an SRE Problem and Knowledge Management Team Lead within the enabling group, Site Reliability Engineering and Governance (SRE & Governance) department.

This role is expected to strategically lead the conduct of incident retrospective/ problem management operations and in other SRE activities in general which pertains to maintenance management that includes availability, performance, change management, monitoring, capacity planning & also the solutions offered derived from emergency response.

The Team Lead is to make sure that the retrospective activities are orchestrated & carried out effectively while promoting the blameless culture in accordance with the SRE principles.



Responsibilities:



Mentor the team in the seamless facilitation & conduct of root cause analysis (RCA) activities from end to end Lead the facilitation for high-severity incidents liaising with top/ senior management and keeping the latter updated Prime focal point for presenting in the RCA Forum, Tech Risk Forum and other senior management meetings to report updates on retrospective findings & action plans Absorb new technology rapidly & apply effectively Communicate well with technical & non-technical colleagues Work to a high standard with agreed timescales Undertake any other tasks or duties that are reasonable & requested by the supervisor or a member of the senior management team. Do resource management to ensure problem management activities are carried out in an effective and efficient manner Provide available platforms and channels to ensure stakeholders are kept updated on results of retrospectives and RCA activities Able to demonstrate authority in the problem management calls. Point of contact for assigned incidents of higher severity (from incident retrospective calls all the way up to Management Report (MR) documentation and publishing Take accountability for initiatives on the enhancement activities related to SRE as a result of retrospectives Collaborates with Engineering Teams within SRE and with LOBs on enabling activities as part of the preventive measures

Requirements:



Minimum 15 years of process improvement/ root cause analysis (RCA) exposure & involvement leading discussions as a problem manager or incident commander, preferably in the Technology & Operations space Experience with JIRA, Confluence, Jenkins, Nexus, SonarQube, Bit bucket, S3, Cloud Computing. Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELG/ELK In depth understanding of Incident & Problem Management functions & activities (i.e. Hardware- & Software-related incident & problem management) Work with stakeholders & command centre in trouble shooting, escalating & solutioning critical site incidents. Identify recurring system/ application issues & work with cloud team, infra teams, product development, vendors & other stakeholders in investigating & resolving cause Maintain accurate documentation of incidents including impact details, timelines, steps taken for mitigation/resolution. Strong verbal & written communication skills particularly effective documentation skills Min 10+ yrs of software development or technical support or operations experience. Basic knowledge of Linux, AIX, Solaris and Windows Exposure to Enterprise databases e.g Oracle, SQL server, Maria DB, MongoDB & Sybase. Knowledge in systems & multi-tier application & network troubleshooting * Essential knowledge & awareness of Public/Private/Hybrid cloud solutions.

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD1531483
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    SG, Singapore
  • Education
    Not mentioned