October 11, 2025
In the last couple of posts we talked about MMLU and LiveBench. That naturally leads to a bigger question: what happens when benchmarks stop being useful?
The answer is saturation.
A benchmark is saturated when models hit really high scores (at or above human level) and all start clustering near the top. At that point, it can’t really tell models apart anymore. That’s what happened with GLUE, SuperGLUE, and now partly with MMLU.
Saturated benchmarks create an illusion of progress. They make it look like all top models are the same, even though user experience tells a very different story. That’s why we need newer, dynamic benchmarks like LiveBench that can keep pace with model improvements and capture what really matters. In other words, it’s a race on both sides: the model and the benchmark.