Job Posted
Empty
Location
Singapore
Full time
Non-Remote
No Visa Sponsored
Who are we looking for
--------------------------
We seek an experienced
Machine Learning Operations Engineer (Senior)
to transform cutting-edge research into robust, production-ready services for synthetic data generation and to optimize both deep learning and classical ML algorithms (e.g. tree-based models) at enterprise scale (billions of rows). You will build and tune model pipelines end-to-end, ensuring high performance, scalability, and reliability across diverse workloads and dataset sizes.
Key Responsibilities:
-------------------------
Algorithm Optimization & Scaling
Optimize bottlenecks of the deep generative models to accelerate training and generation of generative models (e.g. transformer, diffusion, GANs).
Implement distributed training of the models across multi-GPU clusters.
Optimize distributed training of traditional ML models (e.g. XGBoost, LightGBM, CatBoost) on billion-row datasets.
Design best practices for memory management to maximize resource utilization (compute and memory), enabling faster training at lower cost.
Data Handling at Scale
Collaborate with data engineers to design ETL/ELT workflows handling terabyte to petabyte scale tabular and unstructured data.
Implement scalable feature engineering pipelines using distributed computing frameworks (e.g. Spark, Dask, or Ray).
Automate data validation (e.g. schema checks, anomaly detection) with rule-based and ML-driven frameworks.
End to end orchestration
Build ML pipelines that transition research prototypes into reliable production-grade workflow.
Package models into Docker containers and deploy using Kubernetes.
Build automated model and data quality monitoring and validation systems
to ensure data integrity throughout the pipeline lifecycle.
Design robust
error handling
mechanisms, with
automatic retries
and
data recovery
in case of pipeline failures.
Implement logging, monitoring and alerting systems.
Qualifications
------------------
Bachelor's or Master's degree in Computer Science, Electrical Engineering, Software Engineering, Data Science or a related quantitative discipline.
5+ years of hands-on experience optimizing and scaling machine learning models in production environments.
Demonstrated track record of accelerating model training workflows (e.g., transformers, diffusion models, GANs) at multi-GPU scale.
Experience in operating ETL/ELT pipelines handling terabytes to petabytes of tabular and unstructured data using distributed computing tools (e.g. Apache Spark, Dask, Ray).
Demonstrated ability to translate research prototypes into reliable, production-grade ML pipelines with rigorous testing and validation.
Experience in the ML orchestration (e.g. airflow, dagster).