As a Staff Software Engineer, AI Research Infrastructure at Databricks, you will develop and run the research stack that powers Databricks AI Research, designing and building services to schedule, orchestrate, and observe large-scale training and inference experiment workloads across thousands of GPUs, and partnering with research scientists, ML engineers, and platform teams to turn experimental workloads into robust, repeatable pipelines.

Requirements

BS/MS or PhD in Computer Science or related field
5+ years of software engineering experience, including substantial time working on large-scale distributed systems or infrastructure.
Deep experience with building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers.
Proficiency in one or more systems programming languages (e.g., C++, Rust, Go, Java, Scala) and ability to design, implement, and debug complex services.
Experience with building or significantly contributing to cluster schedulers, resource managers, or large-scale job orchestration systems (e.g., Kubernetes, Slurm, Ray, custom internal systems).
Understanding of modern ML training and inference workflows (e.g., distributed training, model parallelism, fine-tuning, evaluation), even if not primarily a research scientist.
Ability to move fast and be pragmatic in getting things done, while caring about operational excellence.
Strong communication skills to communicate clearly with both researchers and engineers, and enjoy translating between research needs and infra realities.

Benefits

Annual performance bonus
Equity
Comprehensive benefits and perks

Requirements

BS/MS or PhD in Computer Science or related field

5+ years of software engineering experience, including substantial time working on large-scale distributed systems or infrastructure.

Deep experience with building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers.

Proficiency in one or more systems programming languages (e.g., C++, Rust, Go, Java, Scala) and ability to design, implement, and debug complex services.

Experience with building or significantly contributing to cluster schedulers, resource managers, or large-scale job orchestration systems (e.g., Kubernetes, Slurm, Ray, custom internal systems).

Understanding of modern ML training and inference workflows (e.g., distributed training, model parallelism, fine-tuning, evaluation), even if not primarily a research scientist.

Ability to move fast and be pragmatic in getting things done, while caring about operational excellence.

Strong communication skills to communicate clearly with both researchers and engineers, and enjoy translating between research needs and infra realities.

Staff Software Engineer - AI Research Infrastructure

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Staff Software Engineer - AI Research Infrastructure

Staff Software Engineer - AI Research Infrastructure

Staff Software Engineer - AI Research Infrastructure

Staff Software Engineer - AI Research Infrastructure

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Staff Software Engineer - AI Research Infrastructure

Staff Software Engineer - AI Research Infrastructure

Staff Software Engineer - AI Research Infrastructure

Job Details

About Databricks