As an AI/HPC System Performance Engineer, you will drive end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. You will work at the intersection of network fabric design, distributed computing, and AI workload behavior to ensure Meta's HPC systems deliver maximum throughput and efficiency for frontier model development.
Requirements
- Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
- Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
- Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
- Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
- Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
- Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
- Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
- Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
- Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
- Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack
Benefits
- Paid time off
- 401(k) matching
- Health insurance
- Dental insurance
- Vision insurance