Lambda is seeking a Senior Site Reliability Engineer to build the world's best deep learning cloud. The ideal candidate will have 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role, and a strong understanding of modern AI infrastructure and Linux-based systems.
Requirements
- Define Fleet Health metrics and indicators to objectively measure and improve system availability
- Collaborate with the observability team on comprehensive monitoring and alerting systems
- Create runbooks and automated remediations for common failure scenarios
- Implement and integrate logging and metrics across platforms
- Participate in on-call rotations and provide support for incident response and resolution
Benefits
- Generous cash & equity compensation
- Health, dental, and vision coverage for you and your dependents
- Commuter/Work from home stipends for select roles
- 401k Plan with 2% company match (USA employees)
- Flexible Paid Time Off Plan