We’re looking for seasoned ML Infrastructure engineers to design, build, and maintain training and serving infrastructure for ML research. The ideal candidate has experience in developing tools to diagnose ML infrastructure problems and failures, working with cloud platforms, and maximizing GPU allocation and utilization.

Requirements

4+ years of experience supporting the infrastructure within an ML environment
Experience in developing tools used to diagnose ML infrastructure problems and failures
Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)
Experience working with GPUs
Experience with large GPU clusters and high-performance computing/networking
Experience with supporting large language model training
Experience with ML frameworks like Pytorch/TensorFlow/JAX
Experience with GPU kernel development

Requirements

4+ years of experience supporting the infrastructure within an ML environment
Experience in developing tools used to diagnose ML infrastructure problems and failures
Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)
Experience working with GPUs
Experience with large GPU clusters and high-performance computing/networking
Experience with supporting large language model training
Experience with ML frameworks like Pytorch/TensorFlow/JAX
Experience with GPU kernel development

Software Engineer, Machine Learning Infrastructure

About the Company

Job Description

Requirements

Similar Jobs

Software Engineer, Machine Learning Infrastructure

Research Engineer, ML Systems (All Industry Levels)

Software Engineer, Machine Learning Infrastructure

Software Engineer, Machine Learning Infrastructure

About the Company

Job Description

Requirements

Similar Jobs

Software Engineer, Machine Learning Infrastructure

Research Engineer, ML Systems (All Industry Levels)

Software Engineer, Machine Learning Infrastructure

Job Details

About Character.AI