Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence

📊 Full opportunity report: Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence on ThorstenMeyerAI.com — validation score, market gap, and execution plan.

TL;DR

All six key AI benchmarks introduced between 2023 and 2024 have either been saturated or are rapidly approaching it, signaling a notable development in AI research capabilities. This pattern suggests that progress in AI may be occurring at a faster rate than previously documented.

All six prominent AI research benchmarks launched in 2023 and 2024 have now either been saturated or are approaching saturation within a timeframe of months, according to recent analysis by Thorsten Meyer. This pattern suggests that AI capabilities are advancing at an accelerated pace, prompting a reassessment of previous timelines for AI development.

Thorsten Meyer’s review of recent data shows that each of the six benchmarks measuring different facets of AI research—ranging from software engineering to model training efficiency—has experienced significant progress. For example, the SWE-Bench, which assesses real-world software engineering skills, improved from 2% to 93.9% in 30 months, reaching saturation. Similarly, the METR time horizon benchmark, measuring task duration, decreased from 30 seconds in 2022 to 12 hours in 2026, representing a substantial increase in speed.

Other benchmarks, such as CORE-Bench, which tests research reproduction, reached 95.5% performance in December 2025 and are considered ‘saturated,’ while MLE-Bench, evaluating end-to-end machine learning engineering, is expected to reach saturation by early 2027. The CPU speedup benchmark, measuring compute efficiency, has increased from 2.9× to 52× within 11 months, surpassing human performance benchmarks.

Implications of Rapid Benchmark Saturation for AI Progress

This pattern indicates that AI systems are making substantial progress toward human-level capabilities across multiple domains. The saturation of these benchmarks suggests that further measurable improvements may slow down, or that current measures no longer effectively differentiate levels of performance. For policymakers, investors, and researchers, this development warrants careful consideration of timelines for AI deployment, regulation, and safety measures, as capabilities continue to advance rapidly.

Amazon

AI benchmarking hardware

As an affiliate, we earn on qualifying purchases.

Background on AI Benchmark Development and Expectations

Since 2023, numerous benchmarks have been introduced to measure AI progress in specific skills such as software engineering, research reproduction, and compute efficiency. Initially, progress was gradual, but recent data shows a marked acceleration, with all six benchmarks now nearing or reaching saturation within a few years. Analysts like Jack Clark have noted that this rapid progression supports forecasts of AI capabilities reaching significant milestones by 2028, challenging earlier, more conservative projections.

“Every benchmark launched in 2023-2024 has either saturated or is tracking toward saturation on a timeline of months, not years.”
— Thorsten Meyer

Amazon

GPU acceleration for AI training

As an affiliate, we earn on qualifying purchases.

Uncertainties Surrounding Benchmark Saturation and Future Trajectories

While the data indicates rapid saturation, it remains uncertain whether these benchmarks fully capture the limits of AI capabilities or if new, more challenging benchmarks will be developed. Additionally, the implications of saturation for real-world AI deployment, safety, and regulation are still under evaluation, and some experts caution that benchmark saturation does not necessarily equate to comprehensive or safe AI performance at scale.

Amazon

AI model training efficiency tools

As an affiliate, we earn on qualifying purchases.

Next Steps in Monitoring AI Progress and Benchmark Evolution

Researchers and industry analysts will continue to monitor existing benchmarks and develop new ones to assess more complex or nuanced skills. Attention will also focus on how saturation impacts AI deployment timelines, safety assessments, and policy frameworks. Further data collection and analysis are expected in the coming months to determine whether saturation indicates a plateau or a new phase of ongoing advancement.

Amazon

AI performance testing software

As an affiliate, we earn on qualifying purchases.

Key Questions

What does it mean that all benchmarks have saturated?

Saturation indicates that AI systems have achieved near-maximum performance on these specific tests, suggesting substantial progress in these areas. However, it does not necessarily mean that AI capabilities have plateaued across all domains.

Does benchmark saturation mean AI development is complete?

No, benchmarks measure specific skills or tasks. Saturation shows progress in those areas but does not imply that all aspects of AI development or deployment are finished or safe.

Are new benchmarks likely to be more challenging?

Yes, as current benchmarks reach saturation, researchers are expected to create more complex tests to measure ongoing improvements and to continue advancing AI capabilities.

How does this impact AI safety and regulation?

Rapid saturation suggests that AI capabilities are advancing quickly, which underscores the importance of ongoing safety, control, and regulatory measures. Policymakers will need to adapt to these developments accordingly.

When can we expect the next major milestone?

Based on current trends, significant advancements are anticipated to continue into 2027, though the exact timing of breakthroughs may vary due to emerging challenges and the development of new benchmarks.

Source: ThorstenMeyerAI.com

Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence

Up next

The Co-Founder’s Black Hole — A Structural Read on Jack Clark’s Automated AI R&D Essay

Author

1023 Jack Team

Share article