Under limited supervision, the Site Reliability Engineer III is responsible for improving system reliability and resilience by building automation to reduce manual effort and prevent service-impacting incidents.
Requirements
- Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance.
- Partners with development teams to improve services through testing and release procedures.
- Balances feature development speed and reliability with service-level objectives.
- Works closely with the incident response team and restoring service to normal operation.
- Investigates, blocks and rate-limits unwanted traffic.
- Utilizes monitoring systems and dashboards for proactive changes and alerting.
- Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable.
Benefits
- Healthcare coverage
- 401(k)
- Tuition reimbursement
- Vacation
- Sick
- Holiday pay