As a Infrastructure Engineer - Site Reliability, you'll be responsible for designing and maintaining the systems that keep Zyphra's infrastructure robust, observable, secure, and scalable.
Requirements
- Experience in high-performance compute environments, such as ML clusters or GPU farms
- Background in infrastructure as code (e.g., Ansible, Terraform)
- Familiarity with software release engineering with for ML/AI systems is a plus
- Experience designing reliable environments for experimental workloads and reproducible runs
- Knowledge of compliance and audit standards in deployment and system security
- Experience with load testing, fault injection, and chaos engineering to harden systems under stress
- Passion for building tooling that makes infrastructure invisible and reliable for end users
Benefits
- Comprehensive medical, dental, vision, and FSA plans
- Competitive compensation and 401(k)
- Relocation and immigration support on a case-by-case basis
- On-site meals prepared by a dedicated culinary team; Thursday Happy Hours