TechInsights is building the reliability and AI operations foundation for its next chapter — an AI-first intelligence platform that runs the most demanding semiconductor intelligence workflows in the world. We're looking for a Senior Site Reliability Engineer who wants to own that foundation.

Requirements

Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering
Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation
Architect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery
Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover
Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing
Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards
Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation
Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently
Drive IDP adoption and enable teams on SRE practices: on-call readiness, SLO definition, runbook development, and self-service tooling
Represent reliability in architectural discussions; surface risk before it's committed to design
Own the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOs
Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry
Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput
Build golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvement
Apply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendations
Own infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust Assurance
Own FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale
Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression
Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity

Benefits

Company-sponsored training and development opportunities
Comprehensive benefits package (health, wellness, life insurance, fitness, English classes)
Flexible vacation policy
Community involvement opportunities through charitable alliances
Wellness resources and support
Inclusive environment that prioritizes diversity, equity, and accessibility
High-growth company driven by high performance

Senior Site Reliability Engineer (Remote Poland)

Products

Use Cases

Insights

Resources

Browse Jobs

Company

Senior Site Reliability Engineer (Remote Poland)

About the role

Requirements

Benefits

About TechInsights

TechInsights

Similar jobs

Job overview

Technologies

Skills required

Categories