We are looking for an engineer with experience in low-level systems programming and optimisation to join our growing ML team, optimising the performance of our models – both training and inference.

Requirements

Understanding of modern ML techniques and toolsets
Experience and systems knowledge required to debug a training run's performance end to end
Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores and the memory hierarchy
Debugging and optimisation experience using tools like CUDA GDB, NSight Systems, NSight Computesight-systems and nsight-compute
Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS
Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads
Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters
Understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
Inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools
Fluency in English

We are looking for an engineer with experience in low-level systems programming and optimisation to join our growing ML team, optimising the performance of our models – both training and inference.

Requirements

Understanding of modern ML techniques and toolsets
Experience and systems knowledge required to debug a training run's performance end to end
Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores and the memory hierarchy
Debugging and optimisation experience using tools like CUDA GDB, NSight Systems, NSight Computesight-systems and nsight-compute
Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS
Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads
Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters
Understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
Inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools
Fluency in English

Machine Learning Performance Engineer

About the role

Requirements

Similar jobs

Products

Use Cases

Insights

Resources

Browse Jobs

Company

Machine Learning Performance Engineer

About the role

Requirements

Similar jobs

Job overview

Categories