Fathom is hiring an AI Engineer - Model Performance to optimize model inference speed, cost, and reliability, and build fine-tuning infrastructure for the AI team. The role involves optimizing real systems serving millions of meetings, choosing between quantization trade-offs, debugging speculative decoding, and figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable.

Requirements

Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching
Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves
Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures
Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks
Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead
Strong signal: Cost modeling for GPU infrastructure — you've had to choose between GPU types and justify the tradeoff
Experience with multimodal models (audio/vision encoders + LLM decoders)
Experience with Modal, Ray Serve, or similar serverless GPU platforms
Understanding of audio processing (codecs, chunking, sample rates)
Experience building internal tooling that other engineers use — this role succeeds when the rest of the team ships faster

Benefits

Competitive compensation and benefits
A dynamic and collaborative engineering team
A supportive environment that encourages innovation and personal growth
Opportunity for impact
Startup experience

Requirements

Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching

Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves

Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures

Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks

Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead

Strong signal: Cost modeling for GPU infrastructure — you've had to choose between GPU types and justify the tradeoff

Experience with multimodal models (audio/vision encoders + LLM decoders)

Experience with Modal, Ray Serve, or similar serverless GPU platforms

Understanding of audio processing (codecs, chunking, sample rates)

Experience building internal tooling that other engineers use — this role succeeds when the rest of the team ships faster

AI Engineer - Model Performance

About the Company

Job Description

Requirements

Benefits

Similar Jobs

AI Engineer - Model Performance

AI Engineer - Model Performance

AI Performance Optimization Engineer

AI Engineer - Model Performance

About the Company

Job Description

Requirements

Benefits

Similar Jobs

AI Engineer - Model Performance

AI Engineer - Model Performance

AI Performance Optimization Engineer

Job Details

About Fathom