LinkedIn is seeking an HPC Network Engineer to design, deploy, and operate high-performance, low-latency Ethernet fabrics for large-scale GPU clusters. The role focuses on RoCE v2–based GPU interconnect networks supporting AI/ML training, inference, and HPC workloads.
Requirements
- Network architecture and design for large-scale LLM training and inference workloads.
- Design RoCE v2–based GPU interconnection fabrics for multi-rack and multi-pod GPU clusters
- Define lossless Ethernet architectures (Clos / fat-tree / leaf-spine) optimized for RDMA
- Select and validate 400G / 800G Ethernet switching platforms and NICs (ConnectX, BlueField, etc.)
- Deep expertise in host-level and Kubernetes pod networking architectures, including enablement of high-performance features such as RDMA and GPU Direct.
- Experience in host network performance tuning for large-scale collective communications, balancing latency, throughput, and congestion control.
- Analyze system performance and diagnose complex cross-layer issues.
Benefits
- Generous health and wellness programs
- Time away for employees of all levels
- Annual performance bonus
- Stock
- Benefits and/or other applicable incentive compensation plans