Join the leader in entertainment innovation and help design the future. As a Machine Learning Operations Engineer, you'll partner with research and development to establish machine-learning best practices and tools.
Requirements
- Troubleshooting high-performance computing, storage, and networks for machine-learning workloads
- Collaborate with research, development, and engineering to establish machine-learning and data management workflows
- Improve capabilities of data set exploration, transformation, and overall data management of large to very large datasets
- Collaborate with research and development to proactively iterate and fine-tune model training for best performance and efficient use of machine-learning resources
- Collaborate with infrastructure teams to improve on-premise and cloud infrastructure
- Improve use of cloud compute and storage for global research teams and manage within budget
- Comprehensive knowledge of AWS and infrastructure-as-code techniques
- Advanced proficiency with Python, Terraform, Cloud Formation, Ansible, git, and related
- Experience with machine learning and scaling workloads with both cloud and on-premise GPU server environments
- Experience with managing and coordinating storage of large machine learning data sets
- Proficiency in Kubernetes cluster design, deployment, and management
- Interest and understanding of industry trends in machine-learning development techniques and tools and processes
- Comprehensive knowledge of continuous integration and continuous release processes and tools