We are building the world's largest AI clusters and are looking for a highly skilled distributed systems engineer to scale and optimize AI infrastructure components like GPU control plane and GPU data plane. You will provide technical leadership to the team and bring clarity to ambiguous problems and come up with innovative solutions.
Requirements
- Design and develop solutions to scale and optimize AI compute infrastructure components like GPU control plane and GPU data plane with the goal to optimize customer experience and customer workload performance on our AI infrastructure.
- Develop âbest-in-classâ AI compute infrastructure for our customers by ensuring that the services and the components are well-defined and modularized, secure, reliable, diagnosable, actively monitored, compliant and reusable.
- Collaborate with cross-functional teams, including development, operations, and product management, to understand their requirements and design innovative orchestration solutions.
- Mentor junior developers and drive modern software engineering practices like leveraging data/telemetry to make decisions, well-defined interfaces across components, design reviews, coding standards, code reviews, and comprehensive coverage from unit test, integration test and active production monitoring.
- Develop benchmark metrics and automation to drive and track performance and reliability across customer workload and lower infrastructure stack.
Benefits
- Flexible medical options
- Life insurance
- Retirement options
- Volunteer programs
- Equal Employment Opportunity