The Frontier Clusters team at OpenAI builds, launches, and supports the largest supercomputers in the world for its most cutting-edge model training. The team is looking for a distributed systems engineer with experience in managing large-scale compute environments, high-performance clusters, or similar hyperscale infrastructure.
Requirements
- Design, build, bringup and operate large-scale compute clusters to power advanced AI research
- Write software that orchestrates the largest clusters in the world, manages resource allocation and automates cluster lifecycle operations
- Have a strong understanding of distributed systems principles with a proven track record in designing scalable, reliable, and secure compute clusters
- Possess strong programming skills, with experience in Python, Go, or similar languages
- Have experience working in public cloud environments (especially Azure)
- Be familiar with high-performance computing, GPU workloads, or AI/ML compute patterns
Benefits
- Relocation assistance
- Hybrid work model (3 days in the office per week)