As a Site Reliability Engineer, you will help ensure the reliability, scalability, and observability of CloudBlue's multi-tenant SaaS platforms used by service providers worldwide. You will focus on improving system stability and performance through monitoring, high availability, and incident response, while working closely with DevOps, Platform, and Engineering teams to build and operate resilient production systems.

Requirements

3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms
Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana
Solid understanding of Linux, networking, and distributed systems fundamentals
Experience working with containerized environments such as Docker and Kubernetes
Strong scripting and automation skills using Python and/or Bash
Experience participating in on-call rotations and incident response in production environments
Strong written and spoken English
Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus
Exposure to hyperscale or service-provider-grade platforms is an advantage
Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued
Experience working with hybrid or on-premises integrations is beneficial
Familiarity with chaos engineering and resilience testing will be considered an asset

Benefits

Work from anywhere - this is a remote opportunity
A competitive salary that values you and your unique skill sets
Career advancement & professional development opportunities to help you reach your full potential
Flexible work arrangements to support work/life balance

Requirements

3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems

Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms

Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana

Solid understanding of Linux, networking, and distributed systems fundamentals

Experience working with containerized environments such as Docker and Kubernetes

Strong scripting and automation skills using Python and/or Bash

Experience participating in on-call rotations and incident response in production environments

Strong written and spoken English

Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus

Exposure to hyperscale or service-provider-grade platforms is an advantage

Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued

Experience working with hybrid or on-premises integrations is beneficial

Familiarity with chaos engineering and resilience testing will be considered an asset

Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Job Details

About HostPapa