We are hiring a hands-on Senior Site Reliability Engineer (SRE) to actively build, operate, and improve the reliability of our production systems. The role is open to US-based candidates and offers a hybrid or fully remote work model.
Requirements
- Design, build, and operate highly available, scalable systems in AWS
- Write, maintain, and review Terraform to provision and manage infrastructure
- Own and improve monitoring, alerting, and observability using Grafana, Pingdom, and Uptrends
- Participate in a rotating on-call schedule, responding to production incidents and driving issues to resolution
- Lead incident response, root cause analysis, and post-incident reviews with a focus on prevention and automation
- Define and manage SLOs, SLIs, and error budgets
- Build and improve CI/CD pipelines and operational workflows using Azure DevOps and GitHub
- Work directly with application teams to improve reliability, performance, and deployability
- Automate manual operational tasks to reduce toil
- Maintain clear, actionable runbooks and documentation in Confluence
- Track work, incidents, and operational improvements using Jira and ServiceNow
- Mentor other engineers and help set SRE standards and best practices
Benefits
- Competitive salary
- Comprehensive benefits
- Flexible work location with hybrid or fully remote options
- Real ownership of production systems and reliability outcomes
- A culture that values automation, learning, and continuous improvement