Temporal is seeking a Staff Software Engineer - Reliability to own the reliability of operating Temporal Cloud end to end, partnering across engineering, infrastructure, and product to drive measurable improvements.
Requirements
- Own reliability outcomes for operating Temporal Cloud end to end, partnering across engineering, infrastructure, and product to drive measurable improvements.
- Define, implement, and evolve reliability targets and associated practices, including alerting thresholds, operational readiness criteria, and escalation paths.
- Plan and run gamedays to validate incident response, operational procedures, and cross-team coordination under realistic failure scenarios.
- Build and scale a chaos testing program that exercises failure modes safely and drives remediation work that reduces real risk.
- Define and maintain a reliability scorecard across services and key operational processes, and use it to prioritize reliability investments.
- Lead load testing and performance testing efforts, including test design, tooling, and analysis of bottlenecks and capacity constraints.
- Improve observability standards (metrics, logs, traces, dashboards) so reliability signals are consistent, actionable, and easy to audit.
- Drive post-incident learning and corrective actions, ensuring fixes are durable and reduce recurrence risk over time.
- Make system-level tradeoffs across reliability, performance, cost, and velocity, and document decisions clearly for long-term maintainability.
- Mentor other engineers and raise the bar on reliability engineering practices across teams.
Benefits
- Unlimited PTO
- 12 Holidays + 2 Floating Holidays
- 100% Premiums Coverage for Medical, Dental, and Vision
- AD&D, LT & ST Disability, and Life Insurance (Standard & Supplemental Available)
- Empower 401K Plan