As a Sr. Site Reliability Engineer, you'll be the guardian of our platform's reliability and performance, ensuring millions of hospitality transactions flow seamlessly across the globe. You'll architect and implement scalable AWS cloud solutions that keep the most ambitious hotels running 24/7, while fostering a culture of automation, resilience, and continuous improvement across our engineering teams.
Requirements
- Design and implement reliable and scalable AWS architecture to meet the needs of the organization.
- Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components.
- Support the CICD process with ArgoCD and GitOps.
- Automate the platform deployments with Terraform infrastructure-as-code.
- Develop and continuously improve product Observability and Monitoring systems based on the Grafana, Prometheus, DataDog, and Cloudwatch.
- Respond and participate with Incident Management and Root Cause Analysis, ensuring minimal impact on services.
- Optimize system performance and troubleshoot issues as they arise.
- Collaborate with development teams to establish monitoring best practices and ensure systems meet reliability targets.
- Collaborate with security teams to implement and maintain security best practices.
- Infrastructure support rotation providing guidance to other engineering teams.
Benefits
- Remote First, Remote Always
- PTO in accordance with local labor requirements
- Monthly Wellness Fridays - enjoy an extra long weekend every month
- Full Paid Parental Leave
- Home office stipend based on country of residency
- Professional development courses in Cloudbeds University
- Access to professional development, including manager training, upskilling and knowledge transfer.