September 21, 2025
As large language models (LLMs) advanced beyond benchmarks like MMLU, researchers faced a problem: traditional test sets were becoming too easy, sometimes even “leaked” into training data, and no longer provided a clear signal of progress. LiveBench emerged as a response: a modern benchmark designed to stay relevant in the era of rapidly evolving AI.
Launched in 2024, LiveBench was developed by a consortium of AI researchers who recognized that benchmarks like MMLU were losing their discriminatory power (see previous blog post). Frontier models such as GPT-4 and Claude had reached human-level accuracy on MMLU, and improvements were starting to plateau. Worse, static datasets risked test-set contamination: many benchmark questions had already appeared on the internet, making it possible that models had seen them during training.
LiveBench set out to reset the bar, offering a benchmark that is harder, fresher, and evolving. Its goal was not just to measure knowledge recall, but to test reasoning, analysis, and adaptability across multiple domains.
When first introduced, LiveBench proved to be a shock test for frontier models. On MMLU, top systems like GPT-4, Claude, and Gemini scored between 80–90%, leaving little room for separation. But on LiveBench, their scores dropped dramatically, often below 65%, and sometimes much lower on reasoning-heavy tasks.
The benchmark completely refreshes every 6 months, see here: LiveBench