Monitoring Lead Engineer

Singapore, Singapore

Job Description

Responsibilities

  • Design, implement, and maintain Datadog-based observability solutions across infrastructure, platforms, and applications.
  • Develop and optimize dashboards, monitors, and alerts to support proactive detection and triage of performance and reliability issues.
  • Integrate custom telemetry pipelines (metrics, logs, traces, events) aligned with OpenTelemetry and platform architecture standards.
  • Manage instrumentation strategies to ensure accurate and consistent coverage across services.
  • Apply SRE principles to improve service reliability, availability, and performance.
  • Define and track SLIs, SLOs, and SLAs for critical systems, and build feedback loops to continuously enhance service health.
  • Automate manual operational processes using Python, Terraform, or CI/CD tooling.
  • Collaborate with development and platform teams to identify resilience patterns and embed observability by design.
  • Serve as the subject matter expert (SME) for Datadog - advising on advanced configurations, integrations, and performance optimization.
  • Enable distributed tracing, APM, RUM, and synthetics capabilities to support end-to-end visibility.
  • Implement and maintain Datadog Terraform configurations, templates, and governance models for enterprise consistency.
  • Conduct performance tuning and cost optimization for Datadog usage across global environments.
  • Partner with the Operations and Platform teams to analyze incident patterns and provide root cause insights through observability data.
  • Lead post-incident reviews and recommend observability-driven improvements to prevent recurrence.
  • Build automation and correlation mechanisms for real-time alert enrichment and contextual diagnostics.
Requirements
  • Bachelor's degree in Computer Science, Information Systems, or a related field.
  • 5+ years of experience in observability engineering or SRE roles within large-scale distributed systems.
  • Deep, hands-on expertise with Datadog, including APM, Logs, Metrics, RUM, and Synthetics.
  • Strong proficiency in:
  • Infrastructure as Code (IaC): Terraform
  • Automation: Python, Bash, or similar scripting languages
  • CI/CD pipelines: Jenkins, GitLab, or GitHub Actions
  • Strong understanding of monitoring patterns, tracing, and event correlation for complex systems.
  • Familiarity with OpenTelemetry and modern observability frameworks.
  • Experience supporting multi-cloud environments (AWS, GCP, Azure).
  • Familiarity with container orchestration (Kubernetes, ECS) and service mesh observability.
  • Understanding of data visualization and analytics for operational reporting.
  • Exposure to AI-driven observability enhancements or integration with LLM-based insights (a plus).
  • Certification in Datadog, AWS, or GCP is advantageous.
Shortlisted candidates will be offered a 1 Year agency contract employment.

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD1674562
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Singapore, Singapore
  • Education
    Not mentioned