Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform — and operational excellence is at the heart of that mission. As a Site Reliability Engineer focused on Operational Excellence, you will help ensure the stability, resilience, and performance of Crusoe’s GPU cloud.
Requirements
- 5+ years of experience in cloud operations, SRE, or related roles
- Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
- Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
- Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
- Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
- Basic Scripting and automation experience (Go, Python, C, C++, or similar)
- Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
- Ability to stay calm, focused, and effective in fast-moving or high-pressure situations
- A growth mindset with enthusiasm for operational excellence, reliability engineering, and continuous improvement
Benefits
- Industry competitive pay
- Restricted Stock Units in a fast growing, well-funded technology company
- Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
- Employer contributions to HSA accounts
- Paid Parental Leave
- Paid life insurance, short-term and long-term disability
- Teladoc
- 401(k) with a 100% match up to 4% of salary
- Generous paid time off and holiday schedule
- Cell phone reimbursement
- Tuition reimbursement
- Subscription to the Calm app
- MetLife Legal
- Company paid commuter benefit; $300 per month