As the leading data and evaluation partner for frontier AI companies, Scale is dedicated to advancing the evaluation and benchmarking of large language models (LLMs). We are building industry-leading LLM evals, setting new standards for model performance assessment.
Requirements
- Drive research on the effectiveness and limitations of existing LLM evaluation techniques.
- Design and develop novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness.
- Communicate, collaborate, and build relationships with clients and peer teams to facilitate cross-functional projects.
- Collaborate with internal teams and external partners to refine metrics and create standardized evaluation protocols.
- Implement scalable and reproducible evaluation pipelines using modern ML frameworks.
- Publish research findings in top-tier AI conferences and contribute to open-source benchmarking initiatives.
- Mentor and guide research scientists and engineers, providing technical leadership across cross-functional projects.
- Stay deeply engaged with the ML research community, tracking emerging work and contributing to the advancement of LLM evaluation science.
- Thrive in a high-energy, fast-paced startup environment and are ready to dedicate the time and effort needed to drive impactful results.
Benefits
- Comprehensive health, dental and vision coverage
- Retirement benefits
- A learning and development stipend
- Generous PTO
- Commuter stipend