As an Applied Research Engineer at Snorkel AI, you will own the infrastructure that powers our model training and evaluation work. This is a hands-on role where you will build and operate GPU cluster infrastructure, training pipelines, and the tooling that allows our research and engineering teams to run experiments reliably and at scale.
Requirements
- Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking.
- Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters.
- Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale.
- Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration.
- Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures.
- Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change.
Benefits
- Generous Paid Time Off
- 401k Matching
- Retirement Plan
- Visa Sponsorship