We're seeking a Site Reliability Engineer to join our Data & AI Platform Engineering team to help keep our data and AI platforms running reliably, efficiently, and securely. The ideal candidate will have hands-on experience with Google Cloud Platform, infrastructure-as-code tools, and observability tooling.
Requirements
- Monitor production systems using observability tooling to detect and triage issues
- Participate in on-call rotations and respond to incidents
- Contribute to blameless post-mortems and documentation
- Maintain and improve SLO dashboards and alerting thresholds
- Identify repetitive manual tasks and build automation to eliminate them
- Write and maintain scripts and tooling to improve deployment reliability
- Operate and maintain workloads running on GCP
- Apply Infrastructure-as-Code practices to consistently manage infrastructure changes
- Support multi-cloud awareness across GCP and Azure
- Adhere to data security and governance policies
Benefits
- Generous Paid Time Off
- 401k Matching
- Retirement Plan