Role Overview

Design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems.

What You Will Do

Design and evolve multi-provider, multi-region GPU compute clusters, serve as primary technical point of contact for customers, define SLOs and error budgets, ensure health and performance of high-speed interconnects, build deep visibility into GPU utilization, and lead incident response for complex failures.

Why It Might Be a Fit

You will have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute.

Requirements

Deep, hands-on experience operating large-scale GPU clusters
Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training
Working knowledge of how large training jobs actually run — NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar
Expert-level Linux knowledge: kernel tuning, driver management, cgroup/namespace internals, performance profiling at the syscall and hardware level
Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators
Strong engineering skills in Python, Go, or Bash, and Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent)
Hands-on experience building monitoring and alerting for GPU infrastructure, and proven track record leading incident response for complex distributed systems

Benefits

Equal opportunity employer
Inclusive environment for all employees
Autonomy to shape how systems run at a foundational level
Significant ownership and impact
Opportunity to work directly with customers and providers

What You Will Do

Requirements

Deep, hands-on experience operating large-scale GPU clusters

Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training

Working knowledge of how large training jobs actually run — NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar

Expert-level Linux knowledge: kernel tuning, driver management, cgroup/namespace internals, performance profiling at the syscall and hardware level

Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators

Strong engineering skills in Python, Go, or Bash, and Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent)

Hands-on experience building monitoring and alerting for GPU infrastructure, and proven track record leading incident response for complex distributed systems

Senior Site Reliability Engineer - AI Infrastructure

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Similar jobs

Products

Use Cases

Insights

Resources

Browse Jobs

Company

Senior Site Reliability Engineer - AI Infrastructure

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Similar jobs

About Andromeda

Andromeda