About OpenAI Evals
OpenAI Evals is an open-source framework and registry for creating, running, and sharing evaluations of large language models (LLMs) and LLM systems. It provides templates, built-in benchmarks, and tooling to run, score, and compare model outputs or build custom evals for specific use cases.
Key Features
- Evaluation framework & registry — templates and community-contributed evals for many tasks
- Model-graded and custom eval support — build custom scoring logic and judge models
- Built-in benchmarks and templates — reproducible evals for standard tasks
- Local and programmatic runs with logging options (e.g., to databases) and integrations
Use Cases & Best For
About Model Evaluation
Test and evaluate AI models