
Job description
We are looking for a Senior Hardware Support Engineer to own production hardware reliability across large-scale, mission-critical data center environments. This role operates at the intersection of hardware engineering, operations, and vendors, ensuring fleet stability, rapid root cause identification, and continuous improvement of server and platform reliability.
Leading root cause analysis for complex hardware and firmware failures across production fleets, aggregating recurring problems and error patterns to identify systemic reliability issues, and acting as the senior escalation point for hardware-related incidents impacting availability or performance.
Strong hands-on expertise with server hardware in data center or large-scale production environments, proven experience performing root cause analysis of hardware and firmware failures, and deep understanding of server components (CPU, memory, storage, networking, power, BMC) and failure modes.
Company

Tech, Software & IT Services
Nebius provides a comprehensive AI cloud platform designed for developers and organizations building and deploying generative AI applications. We offer a full-stack infrastructure – encompassing secure, high-performance computing and cost-optimized resources – enabling efficient machine learning model training and deployment. Nebius caters to a diverse clientele, including startups, enterprises, and research institutions, empowering them to accelerate AI innovation and deliver impactful scientific breakthroughs. Our platform simplifies the complexities of AI infrastructure, allowing teams to focus on core development and maximize the value of their AI investments.
Keep exploring

Nebius

Nebius

Nebius

Nebius

Nebius

Nebius