Engineer / Lead Engineer, Mlops / Sre (developer Experience), Xcloud

Singapore, Singapore

https://sg.mncjobz.com/company/home-team-science-and-technology-agency

Apply Now

Job Description

What the role is:
HTX is the first Science and Technology Agency of its kind in the world, bringing together science and engineering capabilities across the Home Team Departments to transform Singapore's homeland security landscape. We are a statutory board under the Ministry of Home Affairs, dedicated to developing cutting-edge technologies that empower our Home Team to solve crimes, save lives, secure borders, and safeguard public spaces. As the MLOps/SRE Engineer for HTX's developer experience squad, you will be responsible for deploying, operating, and optimizing a production LLM system in our secure infrastructure. You will ensure the agentic code assistant is reliable, performant, and cost-effective, managing the full stack from LLM inference to vector databases, orchestration services, and observability. This role combines deep MLOps expertise with SRE discipline to support Home Team's critical AI infrastructure.
What you will be working on:

LLM Deployment: Deploy and manage LLM models using vLLM/TensorRT-LLM on our GPU infrastructure, optimizing for throughput, latency, and GPU utilization o Infrastructure Management: Provision and maintain the supporting infrastructure including vector databases (for RAG), orchestration services, Redis/queue systems, and API gateways o Performance Optimization: Profile and tune LLM inference performance, experiment with batching strategies, context caching, and quantization techniques to maximize throughput within GPU constraints o Observability: Implement comprehensive monitoring using Prometheus, Grafana, DCGM exporters, and Elastic Stack to track inference latency, token throughput, cache hit rates, and system health o Reliability Engineering: Establish SLOs/SLIs, implement auto-scaling policies, design failure recovery mechanisms, and conduct chaos engineering to ensure high uptime o Cost Optimization: Monitor GPU utilization and inference costs, identify optimization opportunities, and implement strategies to reduce token usage and compute spend o Security & Compliance: Ensure all components operate within secure network boundaries, manage secrets and credentials securely, and maintain audit logs for compliance o Incident Response: Participate in on-call rotation, troubleshoot production incidents, conduct root cause analysis, and implement preventive measures o Capacity Planning: Model future load, forecast GPU requirements, and work with infrastructure teams to scale the platform as adoption grows

What we are looking for:

4+ years of experience in MLOps, SRE, or DevOps roles, with at least 1 year working with ML/AI systems o Hands-on experience deploying and operating LLMs in production (vLLM, TGI, TensorRT-LLM, or similar) o Strong Kubernetes expertise including operators, StatefulSets, and GPU scheduling o Deep understanding of GPU architecture, and inference optimization techniques o Experience with observability tools (Prometheus, Grafana, ELK/Elastic Stack) o Solid Python and Bash scripting skills for automation o Knowledge of vector databases (Milvus, Weaviate, Qdrant, or Pinecone) o Experience with infrastructure-as-code (Terraform, Helm, Kustomize) o Experience with NVIDIA GPUs (A100/H100/B200) and DCGM monitoring o Understanding of LLM inference concepts: KV cache, continuous batching, PagedAttention o Familiarity with Ray clusters, Kubeflow, or MLflow o Background in SRE practices: SLO/SLI definition, error budgets, incident management o Experience with secure or regulated environments o Knowledge of LiteLLM, Kong Gateway, or API management platforms Competencies: o Systems thinking with ability to diagnose complex issues across the ML stack o Data-driven decision making using metrics and telemetry o Proactive mindset focused on reliability, automation, and preventive measures o Strong debugging skills for GPU, networking, and distributed systems issues o Clear incident communication and documentation o Collaborative approach working with data scientists, ML engineers, and platform teams All new hires are appointed on a two-year contract in the first instance and will be assessed and considered for permanent tenure over time, based on performance. As part of the shortlisting process for this role, you may be required to complete a medical declaration and/or undergo further assessment. #LI-HL1

About Home Team Science and Technology Agency (HTX):
HTX is the world's first Science and Technology agency that integrates a diverse range of scientific and engineering capabilities to innovate and deliver transformative and operationally-ready solutions for homeland security. As a statutory board of the Ministry of Home Affairs and integral to the Home Team, HTX works at the forefront of science and technology to empower Singapore's frontline of security. Our shared mission is to amplify, augment and accelerate the Home Team's advantage and secure Singapore as the safest place on planet earth.