Senior Site Reliability Engineer responsible for owning the reliability, scalability, and operational excellence of critical enterprise platforms and shared capabilities.
Requirements
- Design, implement, and operate reliable shared services platforms aligned to TR Technology standards, acting as the key point of escalation for any productionârelated incidents.
- Participate in on-call/shift rotations (L2).
- Own service reliability outcomes, including availability, performance, latency, and capacity.
- Implement site reliability engineering and DevOps best practices.
- Feed non-functional requirements into the product backlog, such as, but not limited to, high availability, scalability, self-healing, observability, security)
- Apply advanced monitoring, alert correlation, and root cause analysis techniques (build and maintain monitoring & alerting for all aspects of infrastructure, micro-services and the platform)
- Act as Incident Commander for high-severity incidents when required: Troubleshoot and monitor until successful mitigation, communicate effectively, postmortem and implementation of the learnings.
- Apply AI/MLâdriven reliability and operational practices, including experience with AIâpowered monitoring, anomaly detection, incident triage, and predictive system analysis.
- Collaborate with engineering and platform teams to integrate AIâbased automation into CI/CD, infrastructure management, and incident response workflows.
- Focus on Continuous improvement and technical standards â drive improvements in productivity, monitoring, tooling and set industry best practices.
Benefits
- Flexible vacation
- Two company-wide Mental Health Days off
- Access to the Headspace app
- Retirement savings
- Tuition reimbursement
- Employee incentive programs
- Resources for mental, physical, and financial wellbeing