We are growing rapidly and expanding the use of AI across our platform and engineering operations. As our systems scale in complexity and business criticality, we are investing in next-generation observability, intelligent automation, and AIOps capabilities to enable proactive, insight-driven operations.

Requirements

8–10+ years of experience in SRE, platform engineering, or cloud infrastructure roles supporting large-scale production environments.
Demonstrated experience leading architecture, reliability strategy, or operational platforms across multiple teams or organizational domains.
Proven track record operating in 24/7 production environments, including incident leadership, postmortem practices, and proactive reliability management.
Cloud & Architecture
Deep expertise designing and operating large-scale AWS environments, including services such as VPC, EC2, EKS/ECS, RDS/DynamoDB, S3, ALB/NLB, IAM, KMS, Route 53, and multi-account architectures.
Experience designing resilient, fault-tolerant systems using multi-AZ/multi-region patterns, graceful degradation, rate limiting, and capacity management.
Observability & Operational Intelligence
Senior-level experience with observability platforms (metrics, logs, traces, events) such as New Relic, Datadog, Prometheus/Grafana, OpenTelemetry, or similar.
Experience defining telemetry standards, instrumentation strategies, centralized dashboards, and low-noise alerting practices.
Experience improving operational signal quality through correlation, noise reduction, or advanced analytics.
AIOps / Intelligent Automation (Preferred)
Experience implementing or evaluating AIOps capabilities such as anomaly detection, event correlation, predictive alerting, automated remediation, or AI-assisted incident analysis.
Familiarity with applying machine learning or AI techniques to operational data, incident trends, or reliability workflows.
Automation & Infrastructure as Code
Expert-level experience with Infrastructure-as-Code using Terraform and/or CloudFormation, including reusable modules, GitOps workflows, and policy-as-code guardrails.
Strong scripting or programming skills (Python, Go, Bash, or similar) for automation and operational tooling.
Systems & Platform Expertise
Expert understanding of Linux systems, networking (TCP/IP, DNS, TLS), and distributed system behavior.
Expert with Kubernetes and cloud-native architecture patterns.
Leadership & Impact
Demonstrated ability to influence technical direction without direct authority.
Experience mentoring senior engineers and setting organization-wide engineering standards.
Ability to operate effectively in complex, high-impact environments and drive initiatives from concept through adoption.

Benefits

Medical, dental and vision insurance
Health Savings Account
Flexible Spending Accounts
Telehealth
401(k) and 401(k) match
Life and AD&D insurance
Short-Term and Long-Term Disability
FTO or PTO
Employee Well-Being program
11 paid holidays plus 1 inclusive holiday per year
Volunteer Time Off
Employee Referral program
Education Reimbursement Program
Employee Recognition and Appreciation program

Sr. Staff SRE

Sr. Staff SRE

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Sr. Staff SRE

Staff Site Reliability Engineer

Sr. Staff Production Engineer

Job Details

About Lytx, Inc.