Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform — and operational excellence is at the heart of that mission. As a Site Reliability Engineer focused on Operational Excellence, you will help ensure the stability, resilience, and performance of Crusoe’s GPU cloud.

Requirements

5+ years of experience in cloud operations, SRE, or related roles
Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
Basic Scripting and automation experience (Go, Python, C, C++, or similar)
Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
Ability to stay calm, focused, and effective in fast-moving or high-pressure situations
A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement

Benefits

Industry competitive pay
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement
Subscription to the Calm app
MetLife Legal
Company paid commuter benefit; $300 per month

Requirements

5+ years of experience in cloud operations, SRE, or related roles

Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)

Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)

Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn

Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible

Basic Scripting and automation experience (Go, Python, C, C++, or similar)

Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders

Ability to stay calm, focused, and effective in fast-moving or high-pressure situations

A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement

Benefits

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $300 per month

Senior+ Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior+ Site Reliability Engineer

Site Reliability Engineer

Staff+ Software Engineer - Cloud Availability Platform Engineering (CAPE)

Senior+ Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior+ Site Reliability Engineer

Site Reliability Engineer

Staff+ Software Engineer - Cloud Availability Platform Engineering (CAPE)

Job Details

About Crusoe