L
o
a
d
i
n
g
.
.
.
https://michele.zonca.org

Benchmark LLM: MMLU

By Michele Zonca

1 September 2025

1 minutes to read

September 1, 2025

As AI models, especially large language models, become more capable and more widespread, there’s increasing demand for rigorous ways to measure what they can and cannot do. Benchmarks are essential tools in this evaluation

As I’m digging into some details, I decided to jot down some thoughts on the topic, mainly for me to refer back to.

One of the first broad, general-purpose benchmarks for large language models is the MMLU (Massive Multitask Language Understanding) benchmark.

Introduced in 2021 (by researchers at UC Berkeley, Stanford, and OpenAI) as a “bar exam for AI”, it is used to test models across 57 subjects from law to medicine using multiple-choice questions.

It quickly became the gold standard for evaluating large language models, with labs racing to report ever-higher scores.Today it remains a baseline for historical comparison, but newer, harder, and dynamic benchmarks are taking its place.

When GPT-3 was tested, it scored only 43.9%, but then GPT-4, Claude 2, and other frontier models began achieving 80–90%, comparable to human experts.

As models rapidly improved, MMLU’s weaknesses became clear. Leading systems began to saturate the benchmark, which reduced its ability to differentiate between state-of-the-art models. Many of its questions had been scraped from publicly available sources, raising concerns about data contamination and inflated scores. Its static, dated test set also meant that once memorized, performance gains were less meaningful. Finally, its reliance on multiple-choice questions limited the benchmark’s scope, failing to capture real-world conversational reasoning, creativity, or robustness