ARC-AGI-3 Benchmark Launch and Performance
TECH

ARC-AGI-3 Benchmark Launch and Performance

38+
Signals

Strategic Overview

  • 01.
    ARC-AGI-3, launched March 25, 2026, is the first interactive reasoning benchmark for AI agents, consisting of 135 turn-based game environments with approximately 1,000 levels — all solvable by untrained humans but scoring below 1% for every frontier model tested. The official announcement from @arcprize on X.com described it as "the only unsaturated agentic intelligence benchmark in the world," noting that it "tests how [models] learn" rather than "what models already know" — a post that drew over 4,300 engagements within the first day.
  • 02.
    Top frontier model scores are Gemini 3.1 Pro Preview at 0.37%, GPT-5.4 at 0.26%, Claude Opus 4.6 at 0.25%, and Grok-4.20 at 0.00%, compared to a 100% human solve rate. These scores went viral on X.com, with @scaling01's breakdown garnering over 3,300 engagements. In sharp contrast, the community-built StochasticGoose system scored 12.58% in the 30-day preview phase using a CNN combined with structured search — over 30 times higher than any frontier LLM — suggesting fundamentally different architectures may hold the key to progress.
  • 03.
    ARC Prize 2026 offers over $2 million across three parallel competition tracks, with submissions closing November 2 and results announced December 4, 2026, and all solutions must be open-sourced under MIT or CC0 licenses.
  • 04.
    Duke University testing revealed that Opus 4.6 scored 97.1% on known tasks with hand-crafted harnesses but dropped to 0% on unfamiliar environments, illustrating that current AI capability relies on pre-existing knowledge rather than genuine skill acquisition.

Why This Matters

ARC-AGI-3 arrives at a moment when the AI industry faces a credibility crisis in benchmarking. Over the past two years, frontier models have rapidly saturated one benchmark after another — from MMLU to HumanEval to ARC-AGI-1 itself. Each time a benchmark is conquered, it is quickly dismissed as insufficiently challenging, and the goalposts shift. ARC-AGI-3 represents the most deliberate attempt yet to create a benchmark that cannot be gamed through memorization, scale, or clever prompting. The fact that every frontier model scores below 1% while untrained humans achieve 100% is not merely an embarrassing number — it is a structural claim about what current AI systems fundamentally cannot do.

The stakes extend well beyond academic measurement. As Mike Knoop reports, frontier labs are paying far more attention to this version than its predecessors. The social response on X.com confirms the benchmark has captured the AI community's attention: the official @arcprize announcement — framing ARC-AGI-3 as the test of "how [models] learn" rather than "what models already know" — generated over 4,300 engagements (3,500 likes, 691 retweets, 188 replies) within the first day. Meanwhile, @scaling01's post breaking down individual model scores (Gemini 3.1 Pro at 0.37%, GPT-5.4 at 0.26%, Opus 4.6 at 0.25%, Grok 4.2 at 0%) attracted over 3,300 engagements, indicating intense public interest in the concrete performance gap. This level of social signal, combined with the $2 million prize pool and open-source requirement, suggests ARC-AGI-3 has achieved something rare: genuine buy-in from both the research community and the organizations whose systems it critiques.

How It Works

ARC-AGI-3 fundamentally changes the testing paradigm from static puzzle-solving to interactive exploration. Each of the 135 environments is a turn-based game with its own internal logic, no instructions, no descriptions, and no stated win conditions. As described officially, "the agent sees a visual state, takes an action, sees the result, and must figure out what it's trying to do on the fly." This design makes it impossible to succeed through pattern matching against training data — the agent must genuinely learn and adapt in real time.

The scoring mechanism uses RHAE (Relative Human Action Efficiency) with a squared penalty: the formula is (human actions / AI actions) squared. This means an AI that takes 100 actions where a human takes 10 receives only a 1% score, not 10%. The squaring ensures that brute-force exploration strategies are heavily penalized, rewarding systems that can efficiently reason about novel environments rather than exhaustively trying every possibility. Human baseline data was collected from over 1,200 players across 3,900+ games, with the second-best human tester's action count used as the baseline to avoid outlier influence. A critical design change from previous versions was highlighted by ARC Prize co-founder Mike Knoop, as reported by @tbpn on X.com: ARC-AGI-3 includes far fewer public demonstration games than ARC-AGI-1 or 2, forcing models to generalize to the larger private evaluation set rather than pattern-match on available examples.

By The Numbers

By The Numbers
ARC-AGI-3 benchmark scores: Human baseline 100% vs frontier AI models all under 1%

The performance gap between humans and AI on ARC-AGI-3 is the largest of any major benchmark in recent memory. Gemini 3.1 Pro Preview leads frontier models at 0.37%, followed by GPT-5.4 at 0.26%, Claude Opus 4.6 at 0.25%, and Grok-4.20 at 0.00%. Humans solve 100% of the environments. These scores rapidly circulated on X.com, with @scaling01's detailed breakdown generating over 3,300 engagements, reflecting the AI community's fascination with the sheer scale of the human-AI gap.

The most significant result from the preview phase may not be a frontier model at all. The community-built StochasticGoose system scored 12.58% using a CNN combined with structured search, completing 18 levels. This is over 30 times higher than any frontier LLM and represents the standout data point of the benchmark's early life: a non-LLM hybrid approach dramatically outperforming the most capable language models in the world. The Duke University finding about Claude Opus 4.6 reinforces this story: with hand-crafted harnesses on known task types, it scored 97.1%, but dropped to 0% on unfamiliar environments. This 97-point swing demonstrates that current AI capability is almost entirely dependent on encountering patterns similar to training data. For historical context, it took four years for frontier models to go from 0% (GPT-3 in June 2020) to 5% (GPT-4o in 2024) on the much simpler ARC-AGI-1.

Impacts & What's Next

The immediate impact of ARC-AGI-3 is a recalibration of expectations around AI capabilities. While frontier models have achieved impressive scores on tasks involving language understanding, code generation, and mathematical reasoning, ARC-AGI-3 exposes a categorical weakness: the inability to acquire new skills in unfamiliar environments without prior training data. This finding supports François Chollet's long-standing argument that current AI systems are sophisticated pattern matchers rather than genuine general intelligences.

The competition timeline — submissions closing November 2, 2026 with results on December 4 — gives the research community roughly eight months to develop novel approaches. The open-source requirement under MIT or CC0 licenses means that any progress will compound: each team's innovations become building blocks for others. Early signs from the preview period suggest that hybrid approaches combining visual processing (CNNs) with structured search may outperform pure language model reasoning — StochasticGoose's 12.58% versus the best frontier LLM's 0.37% is a striking 34x difference. The social conversation on X.com is concentrated among AI researchers and tech commentators, with high-engagement posts from @arcprize (4,300+ engagements) and @scaling01 (3,300+ engagements) dominating the discourse. Notably, no significant Reddit discussion or YouTube coverage has emerged yet — likely because the benchmark launched just one day ago — but the intensity of the X.com response suggests broader platform coverage will follow.

The Bigger Picture

NYU Professor Saining Xie's characterization of LLMs as "anti-Bitter Lesson" captures the deeper theoretical tension ARC-AGI-3 exposes. The Bitter Lesson, articulated by Rich Sutton, argues that general methods leveraging computation ultimately outperform methods that leverage human knowledge. Yet current LLMs succeed precisely because they compress vast quantities of human-generated text — they are, in a sense, the most sophisticated knowledge retrieval systems ever built rather than systems that learn from raw experience. ARC-AGI-3 tests for the latter capability and finds it almost entirely absent.

This raises a fundamental question about the path to artificial general intelligence: can the current paradigm of large language models, even with reasoning chains and tool use, ever develop genuine skill acquisition? Or does AGI require fundamentally different architectures — perhaps ones that learn more like biological systems, through interaction with environments rather than consumption of text? The StochasticGoose result — 12.58% from a CNN plus structured search, compared to sub-1% for every frontier LLM — hints at one direction: combining visual understanding with systematic exploration already outperforms pure LLM reasoning by an order of magnitude. As @arcprize stated in their launch announcement, the benchmark tests "how [models] learn" — and on that measure, the world's most capable AI systems have been found profoundly wanting.

Historical Context

2019-11-01
Published 'On the Measure of Intelligence,' introducing the original ARC benchmark.
2020-06-01
GPT-3, released June 2020, scored 0% on ARC-AGI-1, establishing a baseline that would take four years for frontier models to meaningfully improve upon.
2024-12-01
OpenAI's o3 achieved a breakthrough score on ARC-AGI-1, beginning the saturation of the first-generation benchmark.
2025-03-24
ARC-AGI-2 launched; pure LLMs scored 0% and reasoning systems achieved only single-digit percentages.
2026-01-01
ARC-AGI-2 top solutions reached 54% (Poetiq on Gemini 3 Pro at $30/task), while ARC-AGI-1 became saturated at 85%+.
2026-03-25
ARC-AGI-3 launched at Y Combinator, introducing 135 interactive game environments where all frontier models score below 1%.

Power Map

Key Players
Subject

ARC-AGI-3 Benchmark Launch and Performance

AR

ARC Prize Foundation

Creator and organizer of ARC-AGI-3 and ARC Prize 2026, administering the $2M+ prize pool and defining benchmark methodology.

FR

François Chollet

Creator of the original ARC benchmark (2019) and co-founder of the ARC Prize Foundation, chief advocate for the position that current AI lacks general intelligence.

MI

Mike Knoop

Co-founder of ARC Prize Foundation, reports that frontier labs are paying far more attention to V3 than earlier versions.

OP

OpenAI

Developer of GPT-5.4, which scored 0.26% on ARC-AGI-3; previously achieved breakthrough on ARC-AGI-1 with o3 in late 2024.

GO

Google DeepMind

Developer of Gemini 3.1 Pro Preview, which scored 0.37% — the highest among frontier models on ARC-AGI-3.

AN

Anthropic

Developer of Claude Opus 4.6, which scored 0.25% on ARC-AGI-3; Duke University testing showed 97.1% with scaffolding on known tasks versus 0% on unfamiliar ones.

THE SIGNAL.

Analysts

"Chollet argues that current AI systems lack general intelligence, asserting that "the scaffolding is the human intelligence" — meaning that when models appear capable on known tasks, it is the human-designed structure around them doing the cognitive work, not the model itself."

François Chollet
Creator of ARC, Co-founder of ARC Prize Foundation

"Konwinski endorsed ARC-AGI-3 as a benchmark that "goes at the heart of this gap that exists between actually measuring for AGI and the standard set of benchmark suites that the big labs and essentially everybody seems to use in the rat race of getting 0.5% of improvement over every other state-of-the-art model.""

Andy Konwinski
Laude Institute

"Xie characterizes LLMs as "anti-Bitter Lesson" because they rely entirely on human-generated knowledge rather than learning from raw experience."

Saining Xie
Professor, NYU

"Knoop reports that "frontier labs are paying far more attention to V3 than they did to earlier versions," indicating major AI companies view ARC-AGI-3 as a significant challenge. On X.com, @tbpn reported that Knoop identified a key design change: far fewer public demonstration games than previous versions, forcing models to generalize to the larger private set rather than pattern-match on examples."

Mike Knoop
Co-founder, ARC Prize Foundation
The Crowd

"Announcing ARC-AGI-3. The only unsaturated agentic intelligence benchmark in the world. Humans score 100%, AI <1%. This human-AI gap demonstrates we do not yet have AGI. Most benchmarks test what models already know, ARC-AGI-3 tests how they learn"

@@arcprize3500

"ARC-AGI-3 scores for GPT-5.4, Gemini 3.1 Pro and Opus 4.6. Gemini 3.1 Pro: 0.37%. GPT-5.4: 0.26%. Opus 4.6: 0.25%. Grok 4.2: 0%"

@@scaling012900

"ARC Prize cofounder @mikeknoop says the biggest difference between ARC-AGI-3 and 1 and 2 is: far fewer public demonstration games, forcing models to generalize to the larger private set."

@@tbpn45
Broadcast