ARC-AGI Ceiling Hit? Anthropic's Opus Taps Out Budget, Effort Levels Stall Performance

Anthropic's Opus hits ARC-AGI ceiling! Budget limits cause stalls. See new SOTA scores & why 'low' effort suffices.

Anthropic's Opus Achieves New SOTA on ARC-AGI, But Performance Plateau Signals Budget Saturation

The ongoing race for artificial general intelligence benchmarks has seen another significant milestone crossed, albeit one that raises more questions about methodology than it answers about ultimate capability. On Feb 5, 2026 · 6:55 PM UTC, investigative analyst @gregkamradt highlighted crucial performance data shared by Anthropic regarding their Claude Opus model’s latest run on the rigorous Abstract Reasoning Corpus (ARC-AGI) challenge.

The results confirmed that Opus has established a new State-of-the-Art (SOTA) performance level on the evaluation suite. However, the analysis quickly shifted from celebration to deep technical scrutiny when the impact of varying computational effort levels was examined.

Specific Performance Metrics at Maximum Effort

The quoted results from Anthropic’s testing, performed under a static 120K token thinking budget, demonstrated impressive peak performance:

ARC-AGI-1 (Max Effort): Achieved 93.0% accuracy at a cost of $1.88 per task.
ARC-AGI-2 (Max Effort): Reached 68.8% accuracy, costing $3.64 per task.

While these figures represent the current apex of measured capability on the benchmark, the initial framing of the puzzle presented by @gregkamradt was the surprising lack of significant performance variance across the different effort settings tested—low, medium, high, and max. This convergence suggested that the computational dials were failing to influence the ultimate outcome.

Analysis of Effort Level Performance Discrepancies

The core of the investigation centered on understanding why spending more computational resources—labeled as ‘effort’—did not translate into proportionate performance gains on the ARC-AGI tasks.

The Experimental Setup: Fixed Context, Variable Effort

The experimental design was clear: researchers fixed the model’s maximum available internal reflection space—the context/thinking budget—at 120,000 tokens across all tests. They then deliberately modulated the 'effort' setting applied to Opus, testing low, medium, high, and maximum configurations.

The observable finding was remarkably counter-intuitive for resource allocation optimization:

Performance scores remained highly similar regardless of the assigned effort setting.
The crucial instruction or finding relayed was that for the specific demands of ARC-AGI tasks, the 'low' effort setting proved entirely sufficient to reveal the model’s actual ceiling; cranking the effort to 'max' provided no meaningful benefit.

The Implication: Ceiling Hit at Low Effort

If 'low' effort delivers the same result as 'max' effort, it strongly implies that the model is either being inherently constrained by the task design or that the computational resources provided are being exhausted irrespective of the efficiency setting. This phenomenon moves the bottleneck away from the efficiency of the computation and toward the capacity of the provided thinking space.

Understanding the Saturation Hypothesis

When faced with a flat performance curve across increasing computational investment, researchers must look for a hard limit being hit upstream. In this case, Anthropic proposed a compelling, if restrictive, hypothesis rooted in the model’s internal resource consumption.

The 120K Token Thinking Budget Constraint

Anthropic’s central hypothesis posits that the inherent complexity of the ARC-AGI tasks themselves forces the model to utilize the entirety of its allocated thinking budget—the full 120K tokens—even when the system is nominally set to the 'low' effort level.

This suggests a critical implication:

The primary limiting factor is not the computational effort expended, but the resource ceiling itself—the 120K thinking limit.

If a complex problem requires, say, 125K tokens of deep, iterative reasoning to fully solve, but the model is capped at 120K, then telling the model to be 'less efficient' (low effort) or 'more efficient' (high effort) in its thinking process becomes moot if the task demands that full 120K capacity just to attempt the solution.

Because all effort levels likely hit this hard computational ceiling, the resulting performance scores naturally converge to the same upper bound dictated by the budget, leading to minimal, negligible gains between the low/med/high/max settings. The ceiling isn't the model's innate intelligence; it might be the allocated scratchpad size.

Implications for Future Model Scaling and ARC Evaluation

The saturation discovered by Opus on the ARC-AGI benchmark, under the constraints of a fixed 120K token thinking budget, carries profound implications for how the AI community measures progress.

Benchmarks Testing the Budget, Not the Brain

If a highly capable model like Opus is spending its entire thinking allocation just to reach a performance plateau, it suggests that the ARC-AGI benchmark, under this specific 120K token budget, may no longer serve as an effective differentiator between increasingly capable models operating near this resource limit. The benchmark becomes a test of how well a model utilizes 120K tokens, rather than a pure test of its generalized reasoning potential.

Question for Researchers: How much higher is Opus capable of scoring if it were permitted, say, 250K tokens for internal deliberation on these same tasks?

The Path Forward: Scaling Context or Changing Architecture

This finding points toward two necessary avenues for achieving further measurable progress on this specific flavor of abstract reasoning:

Increased Thinking Budgets: To truly test the upper limits of Opus’s capabilities beyond the current saturation point, future evaluations on ARC-AGI might require significantly increased thinking token limits, allowing the model to explore deeper reasoning paths unconstrained by the existing 120K wall.
Architectural Bypass: Alternatively, if budgets cannot be endlessly scaled, developers must focus on architectural breakthroughs that enable models to achieve higher scores with less reliance on sheer sequential token processing, perhaps by improving internal scaffolding or optimizing task decomposition before the reasoning phase begins.

The achievement of a new SOTA is notable, but the revelation that Opus has likely hit a budget ceiling rather than an intelligence ceiling provides a critical inflection point for how future large models will be tested and scaled in pursuit of true generalized intelligence.

Source: Data and analysis derived from the report shared by @gregkamradt on Feb 5, 2026: https://x.com/gregkamradt/status/2019485110291337681

ARC-AGI Ceiling Hit? Anthropic's Opus Taps Out Budget, Effort Levels Stall Performance

Anthropic's Opus Achieves New SOTA on ARC-AGI, But Performance Plateau Signals Budget Saturation

Specific Performance Metrics at Maximum Effort

Analysis of Effort Level Performance Discrepancies

The Experimental Setup: Fixed Context, Variable Effort

The Implication: Ceiling Hit at Low Effort

Understanding the Saturation Hypothesis

The 120K Token Thinking Budget Constraint

Implications for Future Model Scaling and ARC Evaluation

Benchmarks Testing the Budget, Not the Brain

The Path Forward: Scaling Context or Changing Architecture

Related Topics

Recommended for You

SWE Bench Pro Mystery Deepens: Opus 4.6 Silence Sparks Terminal Bench 2.0 Showdown, But Don't Trust the Numbers!

PaLM API Unleashed: Is Chat-Bison the GPT-3.5 and Vicuna Killer We've Been Waiting For?

AI's Empathy Exposed: LLMs Finally Outperform Human Conversations (And Leave Old Bots in the Dust)