
Job description
Design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems.
Design and evolve multi-provider, multi-region GPU compute clusters, serve as primary technical point of contact for customers, define SLOs and error budgets, ensure health and performance of high-speed interconnects, build deep visibility into GPU utilization, and lead incident response for complex failures.
You will have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute.
Keep exploring
Sign in to see similar jobs
Create a free account to discover roles related to this posting.
Company

Tech, Software & IT Services
Andromeda is a rapidly growing company focused on democratizing access to large-scale AI infrastructure. Founded by Nat Friedman and Daniel Gross, Andromeda provides a platform that efficiently routes AI training and inference jobs across a global compute supply, serving leading AI labs, data centers, and cloud providers. Andromeda’s core value proposition lies in delivering flexibility and cost-efficiency in a dynamic AI market. The company is building a 'liquidity layer' for global AI compute, and is actively expanding its team of talented engineers and researchers to support this ambitious vision and continued growth.