Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior AI Infrastructure Engineer, you will play a key role in building the next generation AI cloud platform – a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware and enables state-of-the-art ML practitioners with self-serve AI cloud services.
Requirements
- 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
- 5+ years experience writing high-performance, well-tested, production quality code
- Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
- Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
- Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
- Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
- Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
- Experience with Cluster API or similar a big plus
- Experience working on high-performance compute, networking, and/or storage a big plus
- Experience virtualizing GPUs and/or Infiniband a big plus
- Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale
- Experience with infrastructure automation tools (Terraform, Ansible), monitoring/observability stacks (Prometheus, Grafana), and CI/CD pipelines (GitHub Actions, ArgoCD)
- Experience building IaaS or PaaS systems at scale a plus
- Experience with DPUs/SmartNICs a plus
- GPU programming, NCCL, CUDA knowledge a plus
Benefits
- Generous Paid Time Off
- 401k Matching
- Retirement Plan
- Equal Opportunity Employer