We are seeking an experienced Site Reliability Engineer / Platform Engineer to join our team and help build and maintain a resilient, scalable infrastructure supporting our applications across multiple cloud providers.
Requirements
- Design, build, and maintain infrastructure across AWS, GCP, and Azure using Infrastructure as Code (IaC) principles
- Implement and optimize CI/CD pipelines using tools like Argo and CircleCI to enable rapid, reliable deployments
- Manage and scale Kubernetes clusters in production environments, ensuring high availability and optimal resource utilization
- Administer and optimize cloud databases including MongoDB, Redis, RDS, and other data stores for performance and reliability
- Develop monitoring, alerting, and observability solutions to identify and resolve issues before they impact users
- Automate routine operational tasks to reduce manual toil and improve system reliability
- Conduct incident response and post-mortem analysis to drive continuous improvement
- Collaborate with development teams to design systems with reliability, scalability, and operational excellence in mind
- Document infrastructure architecture, runbooks, and operational procedures
- Evaluate and implement new tools and technologies to improve platform capabilities
Benefits
- Competitive salary
- Comprehensive benefits package
- Opportunity to work with cutting-edge cloud technologies and tools
- Collaborative environment focused on knowledge sharing and professional growth
- Remote or flexible work arrangement
- Continuous learning and development opportunities