Data Engineer (senior)

SG, Singapore

Job Description

Job Posted
Empty
Location
Singapore
Full time
Non-Remote
No Visa Sponsored

Who Are We Looking For:


---------------------------


We are seeking a experienced

Data Engineer (Senior)

to build and maintain data infrastructure to convert our research into scalable, production-ready solutions for synthetic tabular data generation. You will also architect and operate our large-scale data curation, scraping, and cleaning pipelines to deliver massive amounts of datasets for pretraining and finetuning large language models on tabular and unstructured domains.


This is an

individual contributor (IC)

role suited for someone who thrives in a fast-paced, early-stage start-up environment. The ideal candidate has

experience scaling data and machine learning systems

to handle datasets with

billions of records

and can build and optimize complex data pipelines for enterprise applications. You'll work closely with software, machine learning and applied research teams to optimize performance and ensure seamless integration of systems, handling data from

financial institutions, government agencies, consumer brands and more

.

Key Responsibilities:


-------------------------

#

Data Infrastructure and Pipeline Development




Build data ingestion pipelines from

enterprise relational databases

(e.g.

Oracle

,

SQL Server

,

PostgreSQL

,

MySQL,

Databricks, Snowflake

,

BigQuery)

and files (e.g.

Parquet, CSV)

for large-scale synthetic data pipelines.

Design scalable data pipelines

for

batch processing.



Architect and maintain data warehouses and data lakes (e.g. Delta Lake)

optimized for synthetic data training and generation workflows.


Seamlessly transform

Pandas-based research code

into

production-ready pipelines.



Build automated data quality monitoring and validation systems

to ensure data integrity throughout the pipeline lifecycle.

Implement comprehensive data lineage tracking

and audit capabilities for regulatory compliance and privacy validation.


Design robust

error handling

mechanisms, with

automatic retries

and

data recovery

in case of pipeline failures.


Track performance metrics such as

data throughput

,

latency

, and

processing times

to ensure efficient pipeline operations at scale.


Implement monitoring and alerting (e.g. Prometheus, Grafana) for pipeline health, throughput, and data quality metrics.


Optimize resource allocation and cost efficiency for distributed processing at terabytes to petabyte scale.

#

Massive-Scale Data Collection & Ingestion




Design and build distributed web scraping clusters to extract data from millions of pages.


Build LLM-aided data filtering systems combining automated model scoring to evaluate and prioritize high-quality content.

#

Understanding of ML concepts and algorithms




Fair understanding of machine learning concepts, training workflows and algorithms, with familiarity in tools like PyTorch and Hugging Face.

#

Documentation & Reporting




Create clear

documentation

of data pipelines, workflows, and system architectures to enable smooth handovers and collaboration across teams.

Qualifications


-----------------------


Bachelor's degree in Computer Science, Software Engineering, Data Engineering, or related field with strong foundation in distributed systems and data processing


Expert proficiency at

scaling data pipelines

and

machine learning systems

to handle

billions of rows

in enterprise environments.


3+ years of experience in building scalable data solutions with

Python

and distinct libraries such as:


Data Science Libraries: Pandas, NumPy, Scikit-learn.


Deep Learning Libraries: Pytorch


Scaling Libraries: Spark, Dask, etc


Orchestration tools: Airflow, Dagster, etc


Data validation: Pandera, Pydantic, etc


Expertise in

automated data quality frameworks

including

rule-based and AI-based automation

for

format validation, anomaly detection, statistical validation.




Proficiency in building

ETL/ELT pipelines

and managing data across

relational databases (e.g. PostgreSQL, Oracle Database, SQL Server, MySQL)

,

data lakes (e.g. Delta Lake)

and

cloud storage

.


Experience in building

data monitoring and alerting systems.




Hands-on experience with web scraping tools (Scrapy, Selenium, Puppeteer).

Experience building ML data pipelines

and supporting infrastructure for training and deploying machine learning models at scale.

#

Good to Have



Experience with data governance frameworks

and compliance requirements (GDPR, CCPA, PDPA) in data processing systems.

Experience with containerization and orchestration

using Docker, Kubernetes, and cloud-native deployment strategies.

Strong knowledge of cloud platforms

(AWS, GCP, Azure) and their data services (S3, BigQuery, Data Lake Storage, etc).

Why Join Us:


----------------


This is a unique opportunity for someone looking to actively

build and scale

systems in a fast-moving start-up. If you've successfully scaled machine learning and data systems to billions of rows and thrive in a dynamic, hands-on environment, this role is for you.

Benefits:


-------------


Flexible time-off arrangements


Flexible work arrangements - work from office at One North or WFH on some days


Equity eligibility: Competitive equity packages, with grant size evaluated based on the candidate's experience, skills, and impact.

How to apply:


-----------------


Does this role sound like a good fit to you?

We see this first:

Submit your application



We see this last:

If the above does not work, you may email us your CV (pdf format) at jobs@betterdata.ai.


Include the title of the role in your subject


Indicate your available start - end dates (DDMMYY - DDMMYY)


Send along links/supporting information that best showcase the relevant things you have built and done

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Related Jobs

Job Detail

  • Job Id
    JD1661689
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    SG, Singapore
  • Education
    Not mentioned