The Senior Cloud Platform Engineer will be responsible for ensuring the reliability, performance, and scalability of the company's AI Inferencing Service. The role includes participating in a shared on-call rotation to maintain 24/7 service reliability, developing and maintaining advanced monitoring and alerting systems, and implementing auto-scaling policies to handle variable inference loads cost-effectively.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field
- 5-8+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment
- Strong programming/scripting skills in languages like Python, Go, or Java
- Proven experience with containerization and orchestration technologies (Docker, Kubernetes)
- Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog)
- Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation)
- Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD)
- Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems
Benefits
- Competitive total rewards package, including base salary, equity, and benefits
- 95% premium coverage for employee medical insurance
- 77% premium coverage for dependents
- Health Savings Account (HSA) with employer contribution
- Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans
- Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care
- Well-being benefits, including a full subscription to Headspace, Gympass+ membership, One Medical membership, counseling services with an Employee Assistance Program, and more