Mlops Engineer

Singapore, Singapore

Job Description

Job Title: MLOps Engineer (PyTorch)
Location: Singapore
Job Type: Full-time
About the Opportunity
Our client is seeking an MLOps Engineer with a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advanced PyTorch-based training workloads.
This position is a perfect fit for an engineer who is not just focused on model outcomes, but on the quality and robustness of the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.
Key Responsibilities

  • Architect, build, and maintain end-to-end training and inference pipelines using PyTorch.
  • Develop and maintain high-quality, robust tooling in both Python and C++ to support the entire model training lifecycle.
  • Take full ownership of the core training codebase, enforcing best practices for clarity, modularity, reproducibility, and performance.
  • Design and implement workflows for checkpointing, resuming jobs, model versioning, and experiment tracking.
  • Proactively optimize compute workloads for bare-metal environments, focusing on I/O bottlenecks, CPU/GPU utilization, and memory efficiency.
  • Troubleshoot and debug complex, low-level issues, including networking bottlenecks, distributed training errors (e.g., NCCL), and hardware faults.
  • Configure and manage all ML environments, including containers, package management, GPU drivers, and runtime configurations.
  • Monitor and debug large-scale training jobs running across multiple nodes and GPUs.
Required Qualifications (You Should Have)
  • Deep, expert-level knowledge of PyTorch, including DDP (DistributedDataParallel), mixed precision training, and TorchScript.
  • Advanced programming skills in both C++ and Python.
  • A solid background in computer science fundamentals (data structures, algorithms, concurrency, operating systems).
  • Hands-on experience debugging and tuning bare-metal servers, including Linux administration, kernel parameter tuning, and BIOS tuning.
  • A strong understanding of low-level networking (e.g., RoCE, InfiniBand), interconnects, and distributed training protocols like NCCL and MPI.
  • A proven track record of building reliable, reproducible pipelines for both model training and evaluation.
  • Experience with job schedulers (e.g., SLURM, or custom runners) and cluster monitoring tools.
Preferred Qualifications (Nice-to-Have)
  • Experience with non-standard deployments, such as on-premise local clusters or edge devices (i.e., not public cloud).
  • Active contributions to PyTorch or other open-source ML/HPC tools.
  • Familiarity with Infrastructure-as-Code (IaC) tools like Ansible, Terraform, or Nix.
  • Experience building out a full logging, observability, and alerting stack for training workloads.
How to Apply
Interested candidates are invited to submit their resume, detailing their experience in managing PyTorch workloads on bare-metal infrastructure.

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD1696686
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Singapore, Singapore
  • Education
    Not mentioned