TechInsights is building the reliability and AI operations foundation for its next chapter — an AI-first intelligence platform that runs the most demanding semiconductor intelligence workflows in the world. We're looking for a Senior Site Reliability Engineer who wants to own that foundation.
Requirements
- Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering
- Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation
- Architect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery
- Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover
- Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing
- Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards
- Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation
- Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently
- Drive IDP adoption and enable teams on SRE practices: on-call readiness, SLO definition, runbook development, and self-service tooling
- Represent reliability in architectural discussions; surface risk before it's committed to design
- Own the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOs
- Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry
- Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput
- Build golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvement
- Apply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendations
- Own infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust Assurance
- Own FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale
- Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression
- Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity
Benefits
- Company-sponsored training and development opportunities
- Comprehensive benefits package (health, wellness, life insurance, fitness, English classes)
- Flexible vacation policy
- Community involvement opportunities through charitable alliances
- Wellness resources and support
- Inclusive environment that prioritizes diversity, equity, and accessibility
- High-growth company driven by high performance