Cohere is seeking a Senior ML Systems Engineer to build, maintain, and evolve the training framework for frontier-scale language models. The role involves designing and maintaining core components for fast, reliable, and scalable model training, as well as building tooling that connects research ideas to thousands of GPUs.
Requirements
- Strong engineering experience in large-scale distributed training or HPC systems
- Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
- Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
- Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
- Experience working with containerized environments (Docker, Singularity/Apptainer)
- A track record of building tools that increase developer velocity for ML teams
- Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
- Strong collaboration skills — you’ll work closely with infra, research, and deployment teams
Benefits
- An open and inclusive culture and work environment
- Work closely with a team on the cutting edge of AI research
- Weekly lunch stipend, in-office lunches & snacks
- Full health and dental benefits, including a separate budget to take care of your mental health
- 100% Parental Leave top-up for up to 6 months
- Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
- Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
- 6 weeks of vacation (30 working days!)