Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The AIRE Serving team is responsible for elevating the reliability of Anthropic's token path from client to inference servers and back.
Requirements
- Develop appropriate Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity.
- Design and implement monitoring systems including availability, latency and other salient metrics.
- Assist in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of millions of external customers and high-traffic internal workloads.
- Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
- Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
- Build and maintain cost optimization systems for large-scale AI infrastructure, focusing on accelerator (GPU/TPU/Trainium) utilization and efficiency
Benefits
- Competitive compensation
- Benefits
- Optional equity donation matching
- Generous vacation and parental leave
- Flexible working hours
- Lovely office space