ML Engineer (Data), Foundational Models

Sarvam AI

BengaluruModels3+ yrs

About the role

About Sarvam

Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.

About the Role

You will own the data infrastructure that feeds our next family of foundational models. This means building petabyte-scale curation and filtering pipelines, designing the systems that decide what goes into a training run and in what proportion, and treating data quality with the same rigor a research team would treat an architectural choice.

This is not a glue-code role. The data work at a serious pretraining lab is engineering- and research-heavy: deduplication at scale, quality models, contamination detection, mixture design, curriculum and annealing, attribution and debugging. You should care deeply about all of it.

What You’ll Do

- Design and build large-scale data pipelines for pre-training and post-training — ingestion, parsing, normalization, filtering, deduplication, tokenization, and packing — at petabyte scale.

- Develop and continually improve quality filtering systems, including model-based quality classifiers and contamination detection.

- Own data mixture design, curriculum, and annealing strategies in partnership with the research team. The question "what data did this model see, in what proportion, in what order" should always have a precise answer because of work you did.

- Build the tooling that lets researchers and engineers analyze, slice, attribute, and debug the data.

- Scale the pipeline to handle multilingual corpora, code, math, multi-source web data, and licensed datasets, while keeping provenance and licensing tracked end-to-end.

- Partner with the training infrastructure team so that data is never the bottleneck of a production training run

What We’re Looking For

- BS or MS in Computer Science or a closely related technical field (or equivalent demonstrated experience).

- 3+ years of experience building large-scale data systems — petabyte-scale processing, distributed data pipelines, or comparable. Exceptional early-career candidates with a strong systems background will be considered.

- Hands-on experience with data curation and filtering for LLM training. You should be able to walk through a pre-training corpus you helped build, end to end, and defend the choices that went into it.

- Deep familiarity with distributed data processing frameworks — Spark, Ray, Beam, Dask, or equivalent — and the storage systems that sit underneath them.

- Strong Python; comfort with the low-level pieces of the data path (tokenization, sharding, packing, IO patterns) and the performance tradeoffs they imply.

- M