As a Site Reliability Engineer, you will help ensure the reliability, scalability, and observability of CloudBlue's multi-tenant SaaS platforms used by service providers worldwide. You will focus on improving system stability and performance through monitoring, high availability, and incident response, while working closely with DevOps, Platform, and Engineering teams to build and operate resilient production systems.
Requirements
- 3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
- Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms
- Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana
- Solid understanding of Linux, networking, and distributed systems fundamentals
- Experience working with containerized environments such as Docker and Kubernetes
- Strong scripting and automation skills using Python and/or Bash
- Experience participating in on-call rotations and incident response in production environments
- Strong written and spoken English
- Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus
- Exposure to hyperscale or service-provider-grade platforms is an advantage
- Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued
- Experience working with hybrid or on-premises integrations is beneficial
- Familiarity with chaos engineering and resilience testing will be considered an asset
Benefits
- Work from anywhere - this is a remote opportunity
- A competitive salary that values you and your unique skill sets
- Career advancement & professional development opportunities to help you reach your full potential
- Flexible work arrangements to support work/life balance