The client is a non-profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the fundamental laws of nature. The goal is for the AI Physicist to achieve a breakthrough that unifies quantum field theory & general relativity and to explain the deepest unresolved phenomena in our universe by 2035.

Requirements

Design and run large-scale pre-training experiments for both dense and MoE architectures
Tune optimizer configurations and learning rate schedules
Own model and training recipes end-to-end
Run ablations and scaling-law studies to set optimal tokens-to-train targets
Provide strategic insights to the executive team on financial implications of major decisions
Design capital allocation frameworks that maximize scientific impact while ensuring long-term sustainability
Build and harden high-throughput data pipelines encompassing dataset curation, filtering, deduplication, pack-by-length optimization, and contamination control
Design and implement multilingual and multimodal data ingest systems with intelligent repeat scheduling
Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert/context parallelism, and high-speed interconnects
Choose and configure optimal distributed strategies and launch parameters
Integrate performance primitives including FlashAttention-3, fused optimizers, and custom CUDA/Triton kernels
Debug complex distributed training issues using tools like Nsight, py-spy, TensorBoard, and W&B
Build comprehensive observability systems for long-horizon runs tracking throughput/efficiency, gradient statistics, loss spikes, token-mix drift, data freshness, and evaluation dashboards
Manage multi-node GPU jobs (SLURM/Kubernetes/Ray), debug NCCL hangs, clock skew issues, and implement elastic restart mechanisms
Shepherd multi-week training jobs through completion, recover gracefully from failures, and deliver stable checkpoints with measurable evaluation wins
Establish systems for managing multiple currencies, cross-border partnerships, international payments, and complex funding structures
Create financial frameworks that can adapt to new funding models, from traditional grants to innovative financing mechanisms
Define evaluation suites and red-team protocols to monitor scaling behavior and catch regression signals over long training runs
Partner with safety and alignment teams on SFT/RLAIF/DPO stages and evaluations
Collaborate across research, infrastructure, product, and safety teams to turn research wins into robust model artifacts and services
Lead cross-functional efforts and mentor engineers on distributed training best practices and stabilization techniques
Write crisp RFCs and retrospectives to document learnings and establish institutional knowledge

Benefits

Generous Paid Time Off
401k Matching
Retirement Plan
Visa Sponsorship
Four Day Work Week
Generous Parental Leave
Tuition Reimbursement
Relocation Assistance

Member of Technical Staff, Training Engineer (Large Scale Foundation Models)

Member of Technical Staff, Training Engineer (Large Scale Foundation Models)

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Member of Technical Staff, Training Engineer (Large Scale Foundation Models)

Member of Technical Staff, Training Engineer (Large Scale Foundation Models)

Member of Technical Staff, Engineering

Job Details

About Plus10 Recruitment