Cloudbeds is seeking a Senior Site Reliability Engineer to join their remote team. The ideal candidate will have 5+ years of experience as a SRE or Systems Engineer working with AWS cloud infrastructure, and 3+ years of production experience with Kubernetes, Docker, and Helm charts at scale. The role involves designing and implementing reliable, scalable, and efficient cloud infrastructure to support Cloudbeds' global platform's growth.

Requirements

Design and implement reliable, scalable, and efficient cloud infrastructure to support Cloudbeds' global platform's growth
Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components
Develop and continuously improve monitoring and logging systems using Prometheus, DataDog, and Loki stacks
Participate in on-call rotation to support production environment and ensure rapid response to outages
Lead incident response efforts, ensuring minimal service impact while documenting learnings and implementing preventive measures
Collaborate with development teams to establish Service Level Objectives (SLOs) and ensure systems meet or exceed reliability targets
Champion SRE best practices across engineering, mentoring teams on resiliency, performance optimization, and scalability
Automate platform operations with infrastructure-as-code (Terraform) and configuration management tools

Benefits

Remote First, Remote Always
PTO in accordance with local labor requirements
2 corporate apartment accommodations for team member use for free (San Diego & São Paulo)
Full Paid Parental Leave
Home office stipend based on country of residency
Professional development courses in Cloudbeds University
Access provided to professional Therapy and Coaching
Access to professional development, including manager training, upskilling and knowledge transfer.

Requirements

Design and implement reliable, scalable, and efficient cloud infrastructure to support Cloudbeds' global platform's growth

Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components

Develop and continuously improve monitoring and logging systems using Prometheus, DataDog, and Loki stacks

Participate in on-call rotation to support production environment and ensure rapid response to outages

Lead incident response efforts, ensuring minimal service impact while documenting learnings and implementing preventive measures

Collaborate with development teams to establish Service Level Objectives (SLOs) and ensure systems meet or exceed reliability targets

Champion SRE best practices across engineering, mentoring teams on resiliency, performance optimization, and scalability

Automate platform operations with infrastructure-as-code (Terraform) and configuration management tools

Benefits

Remote First, Remote Always

PTO in accordance with local labor requirements

2 corporate apartment accommodations for team member use for free (San Diego & São Paulo)

Full Paid Parental Leave

Home office stipend based on country of residency

Professional development courses in Cloudbeds University

Access provided to professional Therapy and Coaching

Access to professional development, including manager training, upskilling and knowledge transfer.

Senior Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Job Details

About Third-Party Job Posts