The process of measuring a model's performance on specific tasks, often using benchmarks or custom test sets.
Friendly Description: Evaluation, often shortened to "eval," is how we check whether an AI model is actually any good. It's like a final exam: we give the model a set of carefully chosen questions or tasks, score its answers, and use the results to compare it against other models or earlier versions. Good evals are the only way to know if an AI is improving or just changing.
Example: Before releasing a new AI assistant, a team might run thousands of test questions about coding, math, writing, and reasoning. The eval results show where the model shines and where it stumbles, helping the team decide whether it's ready to ship or needs more work.