Cohere is seeking a Senior ML Systems Engineer to build, maintain, and evolve the training framework for frontier-scale language models. The role involves designing and maintaining core components for fast, reliable, and scalable model training, as well as building tooling that connects research ideas to thousands of GPUs.

Requirements

Strong engineering experience in large-scale distributed training or HPC systems
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
Experience working with containerized environments (Docker, Singularity/Apptainer)
A track record of building tools that increase developer velocity for ML teams
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
Strong collaboration skills — you’ll work closely with infra, research, and deployment teams

Benefits

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Senior ML Systems Engineer, Frameworks & Tooling

Senior ML Systems Engineer, Frameworks & Tooling

About the Company

Job Description

Requirements

Benefits

Job Details

About Cohere

Similar Jobs

Senior ML Systems Engineer, Frameworks & Tooling

Member of Technical Staff, Senior/Staff MLE

Member of Technical Staff, MLE