This is a position that aims to improve the reliability and stability of systems in both cloud and on-premises environments from both the operation and engineering perspectives.
Requirements
- Design and improve reliability and stability
- Design and improve availability, performance, and capacity
- Define and monitor operational indicators (SLI/SLO equivalent)
- Analyze and design/ implement permanent countermeasures
- Automation of operations and reduction of Toil
- Identification and quantification of manual operations
- Runbook automation and coding of operations
- Use of IaC to standardize and improve the reproducibility of environments
- Heightened monitoring and observability
- Monitoring design (metrics, logs, alarms)
- Reduction of alarm noise and meaningful notification design
- Shortening of lead time from fault detection to recovery
- Dev/Ops collaboration
- Collaboration with the development team to stabilize releases
- Design feedback from the operational perspective
- Improvement of the connection point between CI/CD and operations
Benefits
- Flexible work arrangement
- Face-to-face collaboration opportunities
- Inclusive environment