Architect, On-Device Inference
Sarvam AI
About the role
About Sarvam
Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.
About the Role
Own the technical architecture of Sarvam’s on-device products end-to-end - from model export and chipset-specific runtime selection up through the OS-layer voice input integration on Windows, macOS, and Android. You set the standards every other engineer on the team works against, you are the technical interface to OEM partners (Qualcomm, Intel, NVIDIA, AMD, Apple), and you are accountable for hitting the published latency, footprint, and accuracy targets across every supported chipset.
This is a player-coach role. We expect you to write code, debug at the kernel/driver layer when needed, and review every workbook before publication - but your highest-leverage work is in setting the architecture and unblocking the team.
What You’ll Do
- End-to-end latency and footprint budgets across all targets - Memory + runtime SLAs on NPUs / CPUs / GPUs.
- Runtime selection strategy per chipset: when OpenVINO vs. ONNX Runtime+EP, when TensorRT vs. CUDA-direct, when QNN vs. LiteRT, when CoreML vs. CPU fallback. The decision matrix is your deliverable.
- Model export pipeline: how models go from PyTorch to every target runtime, with shared infrastructure where possible and per-runtime customization where necessary.
- xPU selector and graceful-degradation logic: probing host capabilities, driver-version compatibility, fallback paths when the user picks an unavailable backend.
- OEM technical relationships: you are the technical face to Qualcomm, Intel, NVIDIA, AMD, and Apple counterparts. You explain perf wins and losses, escalate driver issues, and influence their roadmaps where we have leverage.
- Tech Lead duties for the optimization team: hiring bar, technical roadmap, weekly architecture reviews, mentorship of the three senior engineers.
What We're Looking For
- 8+ years on ML systems, with at least 3 years shipping on-device inference in production. Resume should show models actually running on user devices, not just internal demos.
- Genuine depth in at least three of: TensorRT/CUDA, OpenVINO, QNN/SNPE, CoreML/ANE, ONNX Runtime EP development, llama.cpp/MLC. Reading-level fluency in the rest.
- Production experience with streaming inference - KV cache management, chunked attention, encoder-decoder cache projection, partial output emission. ASR or LLM streaming both qualify.
- Has shipped against hard latency budgets (sub-second E2E) on heterogeneous hardware. Knows where the time actually goes - capture, preprocessing, model, post-proces