We’re looking for seasoned ML Infrastructure engineers to design, build, and maintain training and serving infrastructure for ML research. The ideal candidate has experience in developing tools to diagnose ML infrastructure problems and failures, working with cloud platforms, and maximizing GPU allocation and utilization.
Requirements
- 4+ years of experience supporting the infrastructure within an ML environment
- Experience in developing tools used to diagnose ML infrastructure problems and failures
- Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)
- Experience working with GPUs
- Experience with large GPU clusters and high-performance computing/networking
- Experience with supporting large language model training
- Experience with ML frameworks like Pytorch/TensorFlow/JAX
- Experience with GPU kernel development