As an Applied Research Engineer at Snorkel AI, you will own the infrastructure that powers our model training and evaluation work. This is a hands-on role where you will build and operate GPU cluster infrastructure, training pipelines, and the tooling that allows our research and engineering teams to run experiments reliably and at scale.

Requirements

Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking.
Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters.
Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale.
Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration.
Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures.
Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change.

Benefits

Generous Paid Time Off
401k Matching
Retirement Plan
Visa Sponsorship

Requirements

Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking.
Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters.
Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale.
Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration.
Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures.
Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change.

Benefits

Generous Paid Time Off
401k Matching
Retirement Plan
Visa Sponsorship

Applied Research Engineer – Training Infra

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Applied Research Engineer – Training Infra

Member of Technical Staff, Training Infra

Member of Technical Staff - Pre-Training Infra

Applied Research Engineer – Training Infra

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Applied Research Engineer – Training Infra

Member of Technical Staff, Training Infra

Member of Technical Staff - Pre-Training Infra

Job Details