We're looking for a Training Cluster Engineer to join our team and design, deploy, and maintain large-scale ML training clusters. The ideal candidate has production experience managing SLURM clusters at scale and hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments.

Requirements

Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
Partner with cloud and colocation providers to ensure cluster availability and performance
Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning

Benefits

Generous Paid Time Off
401k Matching
Retirement Plan
Health Insurance
Paid Holidays

Requirements

Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
Partner with cloud and colocation providers to ensure cluster availability and performance
Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning

Benefits

Generous Paid Time Off
401k Matching
Retirement Plan
Health Insurance
Paid Holidays

Member of Technical Staff - Training Cluster Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Member of Technical Staff - Training Cluster Engineer

Member of Technical Staff - Training Infrastructure Engineer

Distributed Training Engineer

Member of Technical Staff - Training Cluster Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Member of Technical Staff - Training Cluster Engineer

Member of Technical Staff - Training Infrastructure Engineer

Distributed Training Engineer

Job Details

About Black Forest Labs