We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems.
Requirements
- 5+ years of experience in ML systems, infra, or distributed training
- Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
- Strong software engineering fundamentals (Python, systems design, testing)
- Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)
- Ability to implement algorithms across GPUs/nodes based on mathematical specs
- Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team
- Experience with large-scale machine learning workloads (strong ML fundamentals)
Benefits
- Comprehensive medical, dental, and vision
- 401(k) program
- Generous PTO, sick leave, and holidays
- Paid parental leave and family-friendly benefits
- On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station