
Job description
The Senior Incident Manager is responsible for leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and data center services. This individual acts as the central command point during major incidents, ensuring rapid triage, cross-team coordination, effective communication, and structured post-incident analysis.
Lead the response to critical incidents, serve as the Incident Commander, establish clear incident timelines, and maintain incident response documentation.
This role requires deep operational expertise in high-availability infrastructure, large-scale GPU clusters, networking, and cloud platforms, along with strong leadership and communication skills.
Keep exploring
Sign in to see similar jobs
Create a free account to discover roles related to this posting.
Company

Tech, Software & IT Services
Lambda builds and operates a cloud-based superintelligence platform that delivers end-to-end AI infrastructure for deep learning, large-language models, and generative AI. Leveraging high-performance GPUs and distributed training, the company enables rapid development and deployment of foundation models across industries. Its unique focus on scalable AI factories and a unified cloud architecture differentiates Lambda as a leader in accelerating AI adoption. The culture prioritizes collaboration, innovation, and a commitment to responsible AI, making it an attractive environment for talent eager to shape the future of intelligent systems.