We are seeking a staff ML engineer to design and evolve the large-scale offline platform for Unity Vector. The role focuses on building reliable infrastructure for generating training datasets, orchestrating ML workflows, and enabling efficient, distributed model training at scale.
Requirements
- Design and operate large-scale data pipelines that generate training datasets used for machine learning training and experimentation
- Develop infrastructure that supports distributed training workflows using technologies such as Pytorch, Ray Data, and Ray Train
- Integrate ML pipelines with workflow orchestration systems (e.g., Flyte, Airflow, or similar) to enable reliable multi-stage training workflows
- Improve reproducibility and observability of ML pipelines through dataset validation, monitoring, and automated testing
- Optimize performance and resource utilization across distributed compute systems used for data processing and model training
- Partner closely with ML engineers to enable efficient large-scale experimentation and model iteration
- Lead architectural improvements to ensure our offline ML pipelines remain scalable, reliable, and cost-efficient
Benefits
- Comprehensive health, life, and disability insurance
- Commutte subsidy
- Employee stock ownership
- Competitive retirement/pension plans
- Generous vacation and personal days
- Support for new parents through leave and family-care programs
- Office food snacks
- Mental Health and Wellbeing programs and support
- Employee Resource Groups
- Global Employee Assistance Program
- Training and development programs
- Volunteering and donation matching program