As a Staff Software Engineer, AI Research Infrastructure at Databricks, you will develop and run the research stack that powers Databricks AI Research, designing and building services to schedule, orchestrate, and observe large-scale training and inference experiment workloads across thousands of GPUs, and partnering with research scientists, ML engineers, and platform teams to turn experimental workloads into robust, repeatable pipelines.
Requirements
- BS/MS or PhD in Computer Science or related field
- 5+ years of software engineering experience, including substantial time working on large-scale distributed systems or infrastructure.
- Deep experience with building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers.
- Proficiency in one or more systems programming languages (e.g., C++, Rust, Go, Java, Scala) and ability to design, implement, and debug complex services.
- Experience with building or significantly contributing to cluster schedulers, resource managers, or large-scale job orchestration systems (e.g., Kubernetes, Slurm, Ray, custom internal systems).
- Understanding of modern ML training and inference workflows (e.g., distributed training, model parallelism, fine-tuning, evaluation), even if not primarily a research scientist.
- Ability to move fast and be pragmatic in getting things done, while caring about operational excellence.
- Strong communication skills to communicate clearly with both researchers and engineers, and enjoy translating between research needs and infra realities.
Benefits
- Annual performance bonus
- Equity
- Comprehensive benefits and perks