Role Overview

We are seeking a highly skilled Systems Engineer to join our team to support benchmarking of GPU platforms for machine learning and AI workloads. You will play a critical role in evaluating the performance of GPU-based hardware for various deep learning and AI frameworks, enabling data-driven decisions for platform optimization and next-generation hardware development.

What You Will Do

Work closely with hardware, development teams to profile and analyze GPU performance at the system and kernel level. Evaluate and compare GPU performance across different platforms, architectures, and software stacks. Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.

Why It Might Be a Fit

We expect you to have proficient in Unix/Linux, plus Python and Bash for automation. Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries. Proven ability to troubleshoot complex system issues including hardware, software, and networking problems.

Requirements

Proficient in Unix/Linux, plus Python and Bash for automation
Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries
Proven ability to troubleshoot complex system issues including hardware, software, and networking problems
Familiarity with containerized environments (e.g., Docker, Kubernetes)
Experience with modern deep learning frameworks (PyTorch, JAX, vLLM, Tensort-LLM)
Experience with job schedulers and resource managers (Slurm, Volcano, etc.)

Benefits

Competitive compensation
Career growth and learning opportunities
Flexibility and work-life balance
Collaborative and innovative culture
Opportunity to work on impactful AI projects
International environment and talented teams

Role Overview

What You Will Do

Requirements

Proficient in Unix/Linux, plus Python and Bash for automation

Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries

Proven ability to troubleshoot complex system issues including hardware, software, and networking problems

Familiarity with containerized environments (e.g., Docker, Kubernetes)

Experience with modern deep learning frameworks (PyTorch, JAX, vLLM, Tensort-LLM)

Experience with job schedulers and resource managers (Slurm, Volcano, etc.)

HPC System Engineer

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Products

Use Cases

Insights

Resources

Browse Jobs

Company

HPC System Engineer

About the role

Role Overview

What You Will Do

Why It Might Be a Fit

Requirements

Benefits

Similar jobs