Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The AIRE Serving team is responsible for elevating the reliability of Anthropic's token path from client to inference servers and back.

Requirements

Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
Design and implement monitoring systems including availability, latency and other salient metrics.
Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency

Benefits

Competitive compensation
Benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
Lovely office space

Requirements

Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
Design and implement monitoring systems including availability, latency and other salient metrics.
Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency

Benefits

Competitive compensation
Benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
Lovely office space

Senior Software Engineer, AI Reliability Engineering

About the role

Requirements

Benefits

About Anthropic

Products

Use Cases

Insights

Resources

Browse Jobs

Company

Senior Software Engineer, AI Reliability Engineering

About the role

Requirements

Benefits

About Anthropic

Similar jobs

Senior Software Engineer, AI Reliability Engineering

Software Engineer, AI Reliability

Senior Site Reliability Engineer, AI Research

Senior Software Engineer, Inference

Senior AI Engineer

Senior Software Engineer - AI

Anthropic

Similar jobs

Senior Software Engineer, AI Reliability Engineering

Software Engineer, AI Reliability

Senior Site Reliability Engineer, AI Research

Senior Software Engineer, Inference

Senior AI Engineer

Senior Software Engineer - AI