We're looking for a Training Cluster Engineer to join our team and design, deploy, and maintain large-scale ML training clusters. The ideal candidate has production experience managing SLURM clusters at scale and hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments.
Requirements
- Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
- Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
- Partner with cloud and colocation providers to ensure cluster availability and performance
- Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
- Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
- Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning
Benefits
- Generous Paid Time Off
- 401k Matching
- Retirement Plan
- Health Insurance
- Paid Holidays