We are looking for a Research Engineer to work on the systems layer behind large-scale RL training, optimizing kernels, improving memory and communication efficiency, scaling distributed workloads, and pushing the throughput and reliability of training systems closer to hardware limits.
Requirements
- Strong systems engineering experience in AI/ML infrastructure, especially around large-scale model training or inference.
- Deep familiarity with PyTorch and distributed training frameworks such as PyTorch Distributed, DeepSpeed, FSDP, Megatron, vLLM, Ray, or related tooling.
- Experience optimizing training performance across kernels, memory movement, communication overhead, or parallelization strategy.
- Hands-on experience with large-scale training techniques including data parallelism, tensor parallelism, and pipeline parallelism.
- Strong understanding of GPU architecture, profiling, and performance debugging.
Benefits
- Cash Compensation Range of $150-300k
- Flexible work arrangements
- Visa sponsorship and relocation support for international candidates
- Quarterly team offsites, hackathons, conferences, and learning opportunities