Role Overview

The Senior Incident Manager is responsible for leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and data center services. This individual acts as the central command point during major incidents, ensuring rapid triage, cross-team coordination, effective communication, and structured post-incident analysis.

What You Will Do

Lead the response to critical incidents, serve as the Incident Commander, establish clear incident timelines, and maintain incident response documentation.

Why It Might Be a Fit

This role requires deep operational expertise in high-availability infrastructure, large-scale GPU clusters, networking, and cloud platforms, along with strong leadership and communication skills.

Requirements

8+ years experience in incident management, site reliability engineering, or infrastructure operations
Experience managing incidents in large-scale distributed infrastructure environments
Strong understanding of data center operations, GPU compute clusters, networking and storage infrastructure, and cloud or hybrid infrastructure platforms
Proven ability to lead high-pressure incident response situations
Experience with incident management frameworks (ITIL, SRE, or equivalent)
Excellent communication and stakeholder management skills
Experience with incident tracking and monitoring tools such as PagerDuty, ServiceNow, Jira, Datadog, Prometheus / Grafana

Benefits

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Role Overview

What You Will Do

Lead the response to critical incidents, serve as the Incident Commander, establish clear incident timelines, and maintain incident response documentation.

Why It Might Be a Fit

This role requires deep operational expertise in high-availability infrastructure, large-scale GPU clusters, networking, and cloud platforms, along with strong leadership and communication skills.

Requirements

8+ years experience in incident management, site reliability engineering, or infrastructure operations
Experience managing incidents in large-scale distributed infrastructure environments
Strong understanding of data center operations, GPU compute clusters, networking and storage infrastructure, and cloud or hybrid infrastructure platforms
Proven ability to lead high-pressure incident response situations
Experience with incident management frameworks (ITIL, SRE, or equivalent)
Excellent communication and stakeholder management skills
Experience with incident tracking and monitoring tools such as PagerDuty, ServiceNow, Jira, Datadog, Prometheus / Grafana

Benefits

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Senior Incident Manager

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Similar jobs

Products

Use Cases

Insights

Resources

Browse Jobs

Company

Senior Incident Manager

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Similar jobs

About Lambda

Lambda