We are growing rapidly and expanding the use of AI across our platform and engineering operations. As our systems scale in complexity and business criticality, we are investing in next-generation observability, intelligent automation, and AIOps capabilities to enable proactive, insight-driven operations.
Requirements
- 8–10+ years of experience in SRE, platform engineering, or cloud infrastructure roles supporting large-scale production environments.
- Demonstrated experience leading architecture, reliability strategy, or operational platforms across multiple teams or organizational domains.
- Proven track record operating in 24/7 production environments, including incident leadership, postmortem practices, and proactive reliability management.
- Cloud & Architecture
- Deep expertise designing and operating large-scale AWS environments, including services such as VPC, EC2, EKS/ECS, RDS/DynamoDB, S3, ALB/NLB, IAM, KMS, Route 53, and multi-account architectures.
- Experience designing resilient, fault-tolerant systems using multi-AZ/multi-region patterns, graceful degradation, rate limiting, and capacity management.
- Observability & Operational Intelligence
- Senior-level experience with observability platforms (metrics, logs, traces, events) such as New Relic, Datadog, Prometheus/Grafana, OpenTelemetry, or similar.
- Experience defining telemetry standards, instrumentation strategies, centralized dashboards, and low-noise alerting practices.
- Experience improving operational signal quality through correlation, noise reduction, or advanced analytics.
- AIOps / Intelligent Automation (Preferred)
- Experience implementing or evaluating AIOps capabilities such as anomaly detection, event correlation, predictive alerting, automated remediation, or AI-assisted incident analysis.
- Familiarity with applying machine learning or AI techniques to operational data, incident trends, or reliability workflows.
- Automation & Infrastructure as Code
- Expert-level experience with Infrastructure-as-Code using Terraform and/or CloudFormation, including reusable modules, GitOps workflows, and policy-as-code guardrails.
- Strong scripting or programming skills (Python, Go, Bash, or similar) for automation and operational tooling.
- Systems & Platform Expertise
- Expert understanding of Linux systems, networking (TCP/IP, DNS, TLS), and distributed system behavior.
- Expert with Kubernetes and cloud-native architecture patterns.
- Leadership & Impact
- Demonstrated ability to influence technical direction without direct authority.
- Experience mentoring senior engineers and setting organization-wide engineering standards.
- Ability to operate effectively in complex, high-impact environments and drive initiatives from concept through adoption.
Benefits
- Medical, dental and vision insurance
- Health Savings Account
- Flexible Spending Accounts
- Telehealth
- 401(k) and 401(k) match
- Life and AD&D insurance
- Short-Term and Long-Term Disability
- FTO or PTO
- Employee Well-Being program
- 11 paid holidays plus 1 inclusive holiday per year
- Volunteer Time Off
- Employee Referral program
- Education Reimbursement Program
- Employee Recognition and Appreciation program