The Plateau Hits: ARC-AGI-1 Saturated at 95%—Is Tool Building Lagging True AI Potential?
Saturation Point: ARC-AGI-1 Performance Peaks
The frontier of artificial general intelligence benchmarking is experiencing a curious form of plateau, signaled by recent performance metrics shared by @gregkamradt on Feb 12, 2026 · 5:18 PM UTC. Specifically, the ARC-AGI-1 benchmark, a crucial measuring stick for general learning capabilities, has effectively saturated, with top models achieving performance levels exceeding $>\text{95%}$. This signals a victory of sorts—the current iteration of the benchmark is now being mastered by state-of-the-art systems.
This high level of completion, while commendable for the models tested, raises the immediate question: What constitutes the next significant milestone? Achieving a perfect $100%$ score is now the critical next target, representing the point where current systems can solve every task presented by the ARC-AGI-1 suite. Until that threshold is crossed, the benchmark remains technically unsolved, even if its utility for discriminating between the very best performers is diminishing.
However, this saturation does not render ARC-AGI-1 obsolete. As @gregkamradt noted, the metric's utility is shifting from pure capability measurement to efficiency monitoring. With operational costs being a central concern in scaling frontier models, ARC-AGI-1 will serve an ongoing, vital role in tracking intelligence per watt. Future comparisons will look not just at how high a model scores, but how affordably it can approach the ceiling, driving optimization in energy use for complex reasoning.
Research Hypotheses for the Next 12 Months
Building on this performance ceiling, @gregkamradt put forth several provocative hypotheses guiding expectations for the immediate future of AI research and deployment over the coming year. These predictions offer a vital stress test for the pace of innovation in both pure model scaling and practical application engineering.
One immediate litmus test involves verification timelines. The expectation is that other major AI labs will follow the leading edge, with the hypothesis forecasting that multiple entities will achieve $>95%$ verification on ARC-AGI-1 before May of the current year. This suggests a relatively tight distribution of peak performance around the current SOTA models on this specific measure.
More controversially, a prediction was made regarding the economics of intelligence. While performance is rising, the cost floor may be sticky. @gregkamradt forecasts that we will not see a $2\times$ order-of-magnitude cost reduction ($<$0.013/\text{task}$) until June 2027. This implies that immediate breakthroughs in computational efficiency or algorithmic optimization capable of drastically slashing deployment costs are likely more than a year away, constraining rapid democratization despite rising capability.
Perhaps the most insightful hypothesis concerns the utilization gap. The sheer capability inherent in current foundation models—the "potential energy"—is immense, yet the industry's practical deployment mechanisms are clearly lagging. The estimated utilization of current model performance via industry tools is only around $5%$. This highlights a major engineering challenge: translating raw capability into deployable, useful performance. To bridge this chasm, $\sim 2$ more "OpenClaw" style breakthroughs—analogous to major leaps in tool integration or environment interaction—are expected within the next 12 months.
Evolving Landscape of AI Benchmarks
The saturation of ARC-AGI-1 forces the wider benchmarking community to recalibrate its approach. Over the next 12 to 24 months, benchmark development appears to be bifurcating along two distinct, yet equally necessary, strategic paths for advancing the field beyond current mastery.
The first path involves introducing substantially harder problems. This route focuses on complexity scaling, exemplified by proposals like "Frontier 5" or "HLE++" (High-Level Evaluation derivatives). These aim to push the limits of current architectures by demanding deeper abstraction, multi-step reasoning, or novel compositional skills that $95%$ performance on ARC-AGI-1 does not yet necessitate.
The counter-strategy focuses on specialization within niche environments. This involves creating highly detailed, domain-specific challenges, drawing inspiration from projects like "TerminalBench derivatives." By focusing on areas where human expertise is deep but current AI tools struggle—such as complex interactive system management or novel scientific discovery workflows—these benchmarks probe for specialized, robust general intelligence rather than just broad coverage.
For the ARC Prize itself, the focus remains rigidly aligned with the pursuit of AGI: defining and measuring "what humans can do, but AI currently cannot." This difference is crucial. While general benchmarks focus on pushing the objective performance envelope, ARC's goal is to isolate the remaining qualitative gaps in general learning. The static goalpost remains the measurement of learning efficiency—the speed and depth with which AI can master new domains compared to human experts.
The ARC-AGI Series and the Path to AGI
The development of the ARC-AGI series is not a static pursuit; it is an evolving methodology designed to keep pace with rapid advancements. Each subsequent version—ARC-AGI-2, ARC-AGI-3, and beyond—is specifically engineered to improve the measurement fidelity of increasingly complex learning mechanisms that emerge in frontier models.
The theoretical endpoint of this process defines the AGI threshold: the moment when the problem creation pipeline exhausts every problem solvable by humans that current AI methods cannot replicate. This indicates that AGI is not a single score, but the moment the creation of solvable problems ceases to yield new insights into general intelligence.
This entire endeavor serves two critical organizational North Stars for the entities driving this research, as articulated by @gregkamradt. First, it must inspire the next set of frontier open research, providing clear, challenging targets for the global community. Second, it must guide public sense-making, clarifying where the true technological frontier lies and grounding abstract discussions in measurable realities. Ultimately, the paramount metrics are the generation of net new science and the acceleration of open progress.
Reassessing "Difficulty" in Benchmarking
A common misconception surrounding benchmarks like ARC-AGI is to equate high performance scores with "hardness." @gregkamradt explicitly refutes this framing. ARC-AGI is not defined by its inherent hardness for the model, but by its objective accuracy in reflecting the core mechanism of intelligence—the ability to generalize from few examples.
If a model achieves $98%$, it means the benchmark accurately reflected $98%$ of general intelligence capacity as understood today. The perceived difficulty, therefore, is not a measure of the benchmark's design challenge, but rather an emergent property resulting directly from the unsolved nature of general intelligence construction. As intelligence construction becomes easier, the benchmark appears "easier," even if the underlying tasks remain conceptually complex.
Current Performance Snapshot: ARC-AGI State-of-the-Art
To solidify the current state of play, the update included a direct citation of recent, cutting-edge results:
| Model / Evaluation | Performance (%) | Cost per Task |
|---|---|---|
| Gemini 3 Deep Think (ARC-AGI-1) | $96.0%$ | $$7.17$ |
| Gemini 3 Deep Think (ARC-AGI-2) | $84.6%$ | $$13.62$ |
These figures, based on February 2026 evaluations, powerfully illustrate the dichotomy: near-mastery on the older metric (v1) but significantly more room for growth on the newer, more demanding v2 evaluation. The journey continues, driven by the need to measure intelligence that is not just high-scoring, but truly general.
Source
This report is based on the digital updates shared on X. We've synthesized the core insights to keep you ahead of the marketing curve.
