
Job description
We are looking for a Senior Systems HPC Engineer to play a key role in building our hyperscaler platform, working across its core components while analyzing and optimizing the performance of large-scale GPU clusters at the intersection of hardware and software.
Investigate and troubleshoot performance issues of GPU cluster under real workloads (training and inference), Evaluate and integrate new hardware, system configurations and tuning approaches through software stack, Support complex performance-related escalations from internal teams and customers.
We expect you to have 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming), 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning), In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
Company

Tech, Software & IT Services
Nebius provides a comprehensive AI cloud platform designed for developers and organizations building and deploying generative AI applications. We offer a full-stack infrastructure – encompassing secure, high-performance computing and cost-optimized resources – enabling efficient machine learning model training and deployment. Nebius caters to a diverse clientele, including startups, enterprises, and research institutions, empowering them to accelerate AI innovation and deliver impactful scientific breakthroughs. Our platform simplifies the complexities of AI infrastructure, allowing teams to focus on core development and maximize the value of their AI investments.
Keep exploring

Nebius

Nebius

Nebius

Nebius

Nebius
NVIDIA