CoreWeave is seeking an Operations Engineer, Fleet Reliability to join its team. The ideal candidate will have a strong understanding of Linux system administration, troubleshooting hardware and software issues, and experience with data center environments. They will be responsible for configuring and maintaining large-scale high-performance supercomputing clusters, troubleshooting hardware and software issues, and monitoring system performance.
Requirements
- Strong understanding of Linux system administration and internals
- Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably
- Software development or scripting languages (bash, python, powershell, etc)
- 2 + years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix)
- Grafana, Prometheus, promsql queries or similar observability platforms
- Data center environments including server racks, HVAC systems, fiber trays
- Kubernetes administration
- HPC - administering GPU-related workloads
- Bachelor’s degree in a related field or equivalent experience
Benefits
- Medical, dental, and vision insurance - 100% paid for by CoreWeave
- Company-paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
- Ability to Participate in Employee Stock Purchase Program (ESPP)
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our office and data center locations