We are seeking a Senior Site Reliability Engineer (Senior SRE) to drive the scalability, reliability, and efficiency of our critical systems and infrastructure. As a senior member of the team, you will lead SRE initiatives, mentor engineers, and architect solutions that enhance system resilience and operational excellence.
Requirements
- Bachelor's or Master's degree in Computer Science or a related field
- 6+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
- Strong experience with cloud platforms (AWS, GCP, or Azure) and cloud-native technologies
- Expertise in Kubernetes and container orchestration
- Expertise with log management tools like ELK or Graylog
- Strong coding/scripting skills in Python, Go, or Bash for automation
- Deep understanding of networking, DNS, CDN, load balancing, and security
- Proven experience with observability tools (Prometheus, Grafana, ELK, OpenTelemetry)
- Hands-on experience in performance tuning, high availability, and DR strategies
- Strong knowledge of incident management frameworks and reliability metrics (SLOs, SLIs, SLAs)
- Experience leading cross-functional reliability initiatives
Benefits
- Implement security best practices, including infrastructure hardening, zero-trust principles, and identity management
- Ensure compliance with SOC2 and ISO 27001
- Mentor and coach junior SREs, fostering a culture of reliability
- Work closely with development teams to ensure reliability is built into the software lifecycle
- Advocate for chaos engineering, game days, and resilience testing to enhance system robustness