We are looking for a highly skilled distributed systems engineer to scale and optimize AI infrastructure components like GPU control plane and GPU data plane that provide computing resources to customer AI workloads.
Requirements
- Design and develop solutions to scale and optimize AI compute infrastructure components like GPU control plane and GPU data plane with the goal to optimize customer experience and customer workload performance on our AI infrastructure.
- Develop 'best-in-class' AI compute infrastructure for our customers by ensuring that the services and the components are well-defined and modularized, secure, reliable, diagnosable, actively monitored, compliant and reusable.
- Collaborate with cross-functional teams, including development, operations, and product management, to understand their requirements and design innovative orchestration solutions.
- Mentor junior developers and drive modern software engineering practices like leveraging data/telemetry to make decisions, well-defined interfaces across components, design reviews, coding standards, code reviews, and comprehensive coverage from unit test, integration test and active production monitoring.
- Develop benchmark metrics and automation to drive and track performance and reliability across customer workload and lower infrastructure stack.
Benefits
- Flexible medical options
- Life insurance
- Retirement options
- Volunteer programs