The Observability & Operations Engineer will design and implement a comprehensive observability strategy across all AWS environments, leveraging AI-powered tools to detect anomalies and surface insights automatically. They will build and manage monitoring platforms, use AI coding assistants to accelerate development, and own the incident management lifecycle.
Requirements
- Minimum 7-10 years of experience in Software Engineering, Cloud Operations, or Site Reliability Engineering
- 5+ years of hands-on experience with AWS infrastructure and AWS PaaS services; certifications are a plus
- Demonstrated experience building repeatable, code-first pipelines and treating operational configuration as first-class software
- Experience working with polyglot environments including Java, Kotlin, and Node.js
- Demonstrated experience using AI tools in a professional setting
- Deep experience with enterprise observability platforms
- Proficiency with distributed tracing frameworks and log management platforms
- Strong understanding of SRE principles including SLOs, SLAs, error budgets, and chaos engineering
- Hands-on FinOps experience
- Strong working knowledge of AWS PaaS services
- Experience instrumenting polyglot applications and cloud-native microservices for observability
- Proven ability to build repeatable, code-first pipelines
- Experience with CI/CD tooling, specifically Harness
- Solid understanding of Infrastructure as Code using Terraform
- Fluency with AI tools in day-to-day work
- Ability to lead incident response, facilitate blameless post-mortems, and drive long-term reliability improvements
- Strong collaboration skills for working across platform and product engineering teams
- Knowledge of containerization technologies and microservices architecture
Benefits
- Generous Paid Time Off
- 401k Matching
- Retirement Plan
- Health Insurance