The client is a non-profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the fundamental laws of nature. The goal is for the AI Physicist to achieve a breakthrough that unifies quantum field theory & general relativity and to explain the deepest unresolved phenomena in our universe by 2035.
Requirements
- Design and run large-scale pre-training experiments for both dense and MoE architectures
- Tune optimizer configurations and learning rate schedules
- Own model and training recipes end-to-end
- Run ablations and scaling-law studies to set optimal tokens-to-train targets
- Provide strategic insights to the executive team on financial implications of major decisions
- Design capital allocation frameworks that maximize scientific impact while ensuring long-term sustainability
- Build and harden high-throughput data pipelines encompassing dataset curation, filtering, deduplication, pack-by-length optimization, and contamination control
- Design and implement multilingual and multimodal data ingest systems with intelligent repeat scheduling
- Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert/context parallelism, and high-speed interconnects
- Choose and configure optimal distributed strategies and launch parameters
- Integrate performance primitives including FlashAttention-3, fused optimizers, and custom CUDA/Triton kernels
- Debug complex distributed training issues using tools like Nsight, py-spy, TensorBoard, and W&B
- Build comprehensive observability systems for long-horizon runs tracking throughput/efficiency, gradient statistics, loss spikes, token-mix drift, data freshness, and evaluation dashboards
- Manage multi-node GPU jobs (SLURM/Kubernetes/Ray), debug NCCL hangs, clock skew issues, and implement elastic restart mechanisms
- Shepherd multi-week training jobs through completion, recover gracefully from failures, and deliver stable checkpoints with measurable evaluation wins
- Establish systems for managing multiple currencies, cross-border partnerships, international payments, and complex funding structures
- Create financial frameworks that can adapt to new funding models, from traditional grants to innovative financing mechanisms
- Define evaluation suites and red-team protocols to monitor scaling behavior and catch regression signals over long training runs
- Partner with safety and alignment teams on SFT/RLAIF/DPO stages and evaluations
- Collaborate across research, infrastructure, product, and safety teams to turn research wins into robust model artifacts and services
- Lead cross-functional efforts and mentor engineers on distributed training best practices and stabilization techniques
- Write crisp RFCs and retrospectives to document learnings and establish institutional knowledge
Benefits
- Generous Paid Time Off
- 401k Matching
- Retirement Plan
- Visa Sponsorship
- Four Day Work Week
- Generous Parental Leave
- Tuition Reimbursement
- Relocation Assistance