
Job description
Join Nebius' team as a Senior HPC Cluster Engineer to play a key role in developing our cutting-edge hyperscaler platform. The role involves analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.
Tune the performance of GPU clusters and InfiniBand networks, analyze and troubleshoot issues related to GPUs and InfiniBand networks, integrate new hardware into the existing infrastructure, and enhance automation systems for proactive monitoring and issue resolution.
We're looking for a Senior HPC Cluster Engineer with 5+ years of professional experience in system-level software development, 3+ years of hands-on experience with Linux systems, and in-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
Company

Tech, Software & IT Services
Nebius provides a comprehensive AI cloud platform designed for developers and organizations building and deploying generative AI applications. We offer a full-stack infrastructure – encompassing secure, high-performance computing and cost-optimized resources – enabling efficient machine learning model training and deployment. Nebius caters to a diverse clientele, including startups, enterprises, and research institutions, empowering them to accelerate AI innovation and deliver impactful scientific breakthroughs. Our platform simplifies the complexities of AI infrastructure, allowing teams to focus on core development and maximize the value of their AI investments.
Keep exploring

Nebius

Nebius

Nebius
NVIDIA
NVIDIA

Nebius