Anduril Industries is a defense technology company seeking a Senior Site Reliability Engineer to join its Production Engineering team. The successful candidate will design and implement comprehensive monitoring and observability systems, drive incident response, and improve system reliability and performance.
Requirements
- 7+ years of engineering experience with at least 3+ years focused on SRE, production operations, or infrastructure engineering
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
- Deep expertise with Kubernetes in production environments, including operational challenges at scale (100+ nodes)
- Strong programming skills in one or more languages such as Go, Python, Rust, or Java with ability to build production-grade tooling
- Proven experience designing and implementing observability stacks (metrics, logging, tracing) using tools like Prometheus, Grafana, ELK/EFK, or equivalent
- Hands-on experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code practices
- Demonstrated ability to debug complex distributed systems issues across multiple layers of the stack
- Track record of improving system reliability through architectural changes, not just operational band-aids
- Strong incident management and communication skills, with experience leading responses to critical outages
Benefits
- Healthcare Benefits
- Income Protection
- Generous time off
- Family Planning & Parenting Support
- Mental Health Resources
- Professional Development
- Commuter Benefits
- Relocation Assistance
- Retirement Savings Plan