Job Posted
Empty
Location
Singapore
Full time
Non-Remote
No Visa Sponsored
Who Are We Looking For:
---------------------------
We are seeking a experienced
Data Engineer (Senior)
to build and maintain data infrastructure to convert our research into scalable, production-ready solutions for synthetic tabular data generation. You will also architect and operate our large-scale data curation, scraping, and cleaning pipelines to deliver massive amounts of datasets for pretraining and finetuning large language models on tabular and unstructured domains.
This is an
individual contributor (IC)
role suited for someone who thrives in a fast-paced, early-stage start-up environment. The ideal candidate has
experience scaling data and machine learning systems
to handle datasets with
billions of records
and can build and optimize complex data pipelines for enterprise applications. You'll work closely with software, machine learning and applied research teams to optimize performance and ensure seamless integration of systems, handling data from
financial institutions, government agencies, consumer brands and more
.
Key Responsibilities:
-------------------------
#
Data Infrastructure and Pipeline Development
Build data ingestion pipelines from
enterprise relational databases
(e.g.
Oracle
,
SQL Server
,
PostgreSQL
,
MySQL,
Databricks, Snowflake
,
BigQuery)
and files (e.g.
Parquet, CSV)
for large-scale synthetic data pipelines.
Design scalable data pipelines
for
batch processing.
Architect and maintain data warehouses and data lakes (e.g. Delta Lake)
optimized for synthetic data training and generation workflows.
Seamlessly transform
Pandas-based research code
into
production-ready pipelines.
Build automated data quality monitoring and validation systems
to ensure data integrity throughout the pipeline lifecycle.
Implement comprehensive data lineage tracking
and audit capabilities for regulatory compliance and privacy validation.
Design robust
error handling
mechanisms, with
automatic retries
and
data recovery
in case of pipeline failures.
Track performance metrics such as
data throughput
,
latency
, and
processing times
to ensure efficient pipeline operations at scale.
Implement monitoring and alerting (e.g. Prometheus, Grafana) for pipeline health, throughput, and data quality metrics.
Optimize resource allocation and cost efficiency for distributed processing at terabytes to petabyte scale.
#
Massive-Scale Data Collection & Ingestion
Design and build distributed web scraping clusters to extract data from millions of pages.
Build LLM-aided data filtering systems combining automated model scoring to evaluate and prioritize high-quality content.
#
Understanding of ML concepts and algorithms
Fair understanding of machine learning concepts, training workflows and algorithms, with familiarity in tools like PyTorch and Hugging Face.
#
Documentation & Reporting
Create clear
documentation
of data pipelines, workflows, and system architectures to enable smooth handovers and collaboration across teams.
Qualifications
-----------------------
Bachelor's degree in Computer Science, Software Engineering, Data Engineering, or related field with strong foundation in distributed systems and data processing
Expert proficiency at
scaling data pipelines
and
machine learning systems
to handle
billions of rows
in enterprise environments.
3+ years of experience in building scalable data solutions with
Python
and distinct libraries such as:
Data Science Libraries: Pandas, NumPy, Scikit-learn.
Deep Learning Libraries: Pytorch
Scaling Libraries: Spark, Dask, etc
Orchestration tools: Airflow, Dagster, etc
Data validation: Pandera, Pydantic, etc
Expertise in
automated data quality frameworks
including
rule-based and AI-based automation
for
format validation, anomaly detection, statistical validation.
Hands-on experience with web scraping tools (Scrapy, Selenium, Puppeteer).
Experience building ML data pipelines
and supporting infrastructure for training and deploying machine learning models at scale.
#
Good to Have
Experience with data governance frameworks
and compliance requirements (GDPR, CCPA, PDPA) in data processing systems.
Experience with containerization and orchestration
using Docker, Kubernetes, and cloud-native deployment strategies.
Strong knowledge of cloud platforms
(AWS, GCP, Azure) and their data services (S3, BigQuery, Data Lake Storage, etc).
Why Join Us:
----------------
This is a unique opportunity for someone looking to actively
build and scale
systems in a fast-moving start-up. If you've successfully scaled machine learning and data systems to billions of rows and thrive in a dynamic, hands-on environment, this role is for you.
Benefits:
-------------
Flexible time-off arrangements
Flexible work arrangements - work from office at One North or WFH on some days
Equity eligibility: Competitive equity packages, with grant size evaluated based on the candidate's experience, skills, and impact.
How to apply:
-----------------
Does this role sound like a good fit to you?
We see this first:
Submit your application
We see this last:
If the above does not work, you may email us your CV (pdf format) at jobs@betterdata.ai.
Include the title of the role in your subject
Indicate your available start - end dates (DDMMYY - DDMMYY)
Send along links/supporting information that best showcase the relevant things you have built and done
Beware of fraud agents! do not pay money to get a job
MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.