Role Overview

As an HPC Operations Engineer at Lambda, you will be responsible for remotely deploying and configuring large-scale HPC clusters for AI workloads, troubleshooting and resolving HPC cluster issues, and providing clear and detailed requirements back to other engineering teams.

What You Will Do

Remotely deploy and configure large-scale HPC clusters for AI workloads, troubleshoot and resolve HPC cluster issues, and provide clear and detailed requirements back to other engineering teams.

Why It Might Be a Fit

You will be working with a team of experienced engineers, and will have the opportunity to mentor and assist less experienced team members. You will also have the chance to stay up-to-date on the latest HPC/AI technologies and best practices.

Requirements

5+ years of experience in deploying and configuring HPC clusters for AI workloads
Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
Expertise in configuring and troubleshooting SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics
Experience with Linux based compute nodes, firmware updates, driver installation
Experience with SLURM, Kubernetes, or other job scheduling systems
Excellent problem solving and troubleshooting skills
Ability to work independently and as part of a team
Comfortable mentoring and supporting junior HPC engineers on cluster deployments

Benefits

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Requirements

5+ years of experience in deploying and configuring HPC clusters for AI workloads

Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking

Expertise in configuring and troubleshooting SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics

Experience with Linux based compute nodes, firmware updates, driver installation

Experience with SLURM, Kubernetes, or other job scheduling systems

Excellent problem solving and troubleshooting skills

Ability to work independently and as part of a team

Comfortable mentoring and supporting junior HPC engineers on cluster deployments

HPC Operations Engineer

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Products

Use Cases

Insights

Resources

Browse Jobs

Company

HPC Operations Engineer

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Similar jobs