We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security.
Requirements
- Expert in site reliability engineering with proven experience defining, monitoring, and maintaining SLOs and SLAs for production systems
- Strong background in capacity planning and management, including forecasting, resource allocation, and cost optimization for distributed systems
- Experienced in incident response, on-call rotations, and post-mortem processes with a track record of reducing MTTR and improving system resilience
- Deep knowledge of deployment systems including progressive rollouts, canary deployments, feature flags, and automated rollback mechanisms
- Proficient in observability tools and practices including metrics, logging, tracing, and alerting systems (Prometheus, Grafana, ELK stack, or similar)
- Strong understanding of infrastructure security including tenant isolation, workload isolation, network segmentation, and security hardening
- Experience with secrets management, key management systems (KMS), certificate management, and secure credential rotation
- Knowledge of compliance frameworks and security best practices for cloud platforms (SOC 2, ISO 27001, or similar)
- Excellent problem-solving skills with ability to debug complex distributed systems issues under pressure
- Strong automation mindset with experience using infrastructure-as-code, configuration management, and CI/CD pipelines