Benchmark

Level 3

Short Description

A standardized test used to compare AI model performance, such as MMLU, HumanEval, or GSM8K.

Friendly Description: A benchmark is a standardized test for AI models, kind of like the SAT for students. Researchers create a fixed set of questions or tasks, then run different models through them and see who scores best. Benchmarks help us compare apples to apples and notice when models really are getting smarter (and not just sounding smarter).

Example: MMLU is a popular benchmark with thousands of multiple-choice questions across subjects like history, biology, and law. When a new AI model is released, the team often shares its MMLU score so people can see how it stacks up against earlier models, the same way a runner's mile time tells you how fast they really are.