Join an intellectually stimulating work environment and be a pioneer in AI benchmarking and dataset engineering. Collaborate with our R&D team to build evaluation infrastructure for post-transformer models.
Requirements
- Proactively identify, prioritize, and curate relevant public and client-driven benchmarks
- Evaluate candidate benchmarks for clarity, data quality, evaluation methodology, and fit with our model roadmap
- Run benchmarks with baseline models to validate setup, uncover edge cases, and de-risk R&D runs
- Hand off "benchmark-ready" packages to R&D (specs, data, evaluation scripts, expected metrics, constraints)
- Maintain a shared vocabulary and documentation around benchmarks, datasets, and evaluation formats
- Track and organize benchmark results, model leaderboards, and "what good looks like" for different customers and scenarios
- Contribute to demos and public-facing proof points based on benchmark outcomes
Benefits
- Full-time, permanent contract
- Remote work
- Possibility to work or meet with other team members in one of our offices: Palo Alto, CA; Paris, France or Wroclaw, Poland
- Flexible compensation based on profile and location