We are looking for an engineer with experience in low-level systems programming and optimisation to join our growing ML team, optimising the performance of our models – both training and inference.
Requirements
- Understanding of modern ML techniques and toolsets
- Experience and systems knowledge required to debug a training run's performance end to end
- Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores and the memory hierarchy
- Debugging and optimisation experience using tools like CUDA GDB, NSight Systems, NSight Computesight-systems and nsight-compute
- Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN and cuBLAS
- Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads
- Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters
- Understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
- Inventive approach and the willingness to ask hard questions about whether we're taking the right approaches and using the right tools
- Fluency in English