CoreWeave is seeking a Senior Site Reliability Engineer, Data Infrastructure to own the reliability and performance of their Kubernetes-based data platform. The role involves designing and operating highly available, multi-region systems, ensuring strict uptime and latency targets. The team builds and operates the foundational systems that power data ingestion, transformation, analytics, and internal AI workloads at scale.
Requirements
- 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering roles
- Deep expertise in Kubernetes and containerized software services, including cluster design, operations, and troubleshooting in production environments
- Strong experience building and operating CI/CD systems, including tools such as Argo CD and GitHub Actions
- Proven experience owning production systems with high availability requirements (≥99.99% uptime), including incident response, SLI/SLO/SLA definition, error budgets, and postmortems
- Hands-on experience designing and operating geo-replicated, multi-region, active-active systems, including traffic routing, failover strategies, and data consistency tradeoffs
- Strong experience building and owning observability components, including metrics, logging, and tracing (e.g., Prometheus, Grafana, OpenTelemetry).
Benefits
- Medical, dental, and vision insurance - 100% paid for by CoreWeave
- Company-paid Life Insurance
- Voluntary supplemental life insurance
- Short and long-term disability insurance
- Flexible Spending Account
- Health Savings Account
- Tuition Reimbursement
- Ability to Participate in Employee Stock Purchase Program (ESPP)
- Mental Wellness Benefits through Spring Health
- Family-Forming support provided by Carrot
- Paid Parental Leave
- Flexible, full-service childcare support with Kinside
- 401(k) with a generous employer match
- Flexible PTO
- Catered lunch each day in our office and data center locations