Cloudbeds is seeking a Senior Site Reliability Engineer to join their remote team. The ideal candidate will have 5+ years of experience as a SRE or Systems Engineer working with AWS cloud infrastructure, and 3+ years of production experience with Kubernetes, Docker, and Helm charts at scale. The role involves designing and implementing reliable, scalable, and efficient cloud infrastructure to support Cloudbeds' global platform's growth.
Requirements
- Design and implement reliable, scalable, and efficient cloud infrastructure to support Cloudbeds' global platform's growth
- Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components
- Develop and continuously improve monitoring and logging systems using Prometheus, DataDog, and Loki stacks
- Participate in on-call rotation to support production environment and ensure rapid response to outages
- Lead incident response efforts, ensuring minimal service impact while documenting learnings and implementing preventive measures
- Collaborate with development teams to establish Service Level Objectives (SLOs) and ensure systems meet or exceed reliability targets
- Champion SRE best practices across engineering, mentoring teams on resiliency, performance optimization, and scalability
- Automate platform operations with infrastructure-as-code (Terraform) and configuration management tools
Benefits
- Remote First, Remote Always
- PTO in accordance with local labor requirements
- 2 corporate apartment accommodations for team member use for free (San Diego & São Paulo)
- Full Paid Parental Leave
- Home office stipend based on country of residency
- Professional development courses in Cloudbeds University
- Access provided to professional Therapy and Coaching
- Access to professional development, including manager training, upskilling and knowledge transfer.