TECH

AA-AgentPerf benchmark and NVIDIA Blackwell efficiency

18+

Signals

Strategic Overview

01.
Artificial Analysis released AA-AgentPerf, its first agentic inference hardware benchmark, which replays real coding-agent trajectories and measures how many concurrent agents a system can serve while meeting production service-level targets.
02.
The lead metric is Agents per Megawatt, the maximum number of agents an accelerator platform can serve per megawatt of power, positioned as the metric that matters most in a power-constrained AI buildout.
03.
NVIDIA's GB300 NVL72 led the first results on DeepSeek V4 Pro, running up to 20x more agents per megawatt than the prior-generation H200, and ahead of AMD's MI355X.
04.
The workloads use real coding-agent runs of up to 200 turns with sequence lengths beyond 100K tokens across 12+ programming languages, and the benchmark allows production optimizations such as KV cache reuse, speculative decoding, and disaggregated prefill/decode.

Deep Analysis

Why 'Agents per Megawatt' is the metric that actually matters

Most inference benchmarks report tokens per second or cost per million tokens. AA-AgentPerf throws those aside for a deliberately different headline number: Agents per Megawatt, the maximum number of concurrent agents an accelerator platform can serve for each megawatt of power it draws ^[1]. The reasoning is that AI data centers are now bottlenecked by power availability, not rack space or capital, so the right question for a buyer is no longer 'how fast' but 'how many agents can I run inside my power envelope.' What makes the benchmark credible is the workload underneath it: rather than synthetic single-turn prompts, it replays real coding-agent trajectories of up to 200 turns with sequence lengths beyond 100K tokens across 12+ programming languages, scored against multiple service-level tiers ^[1]. Crucially, it is the first inference benchmark to allow the production optimizations labs actually run, KV cache reuse (reusing already-computed attention state instead of recomputing it), speculative decoding (a small draft model proposing tokens a large model verifies in bulk), and disaggregated prefill/decode (splitting the prompt-ingestion and token-generation phases onto different hardware) ^[1]. That makes the results a measure of deployable efficiency, not lab-bench throughput.

The architectural reason rack-scale Blackwell wins: disaggregation

NVIDIA's GB300 NVL72 led the launch results, and the mechanism behind its lead is structural, not just a faster chip. The GB300 NVL72 links 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single NVLink fabric with roughly 130 TB/s of bandwidth and ~20-21 TB of HBM3e, presented as one liquid-cooled rack ^[4]. Because the whole rack behaves as one large memory-coherent pool, inference can be aggressively disaggregated, prefill and decode placed on separate GPUs and the giant KV cache shared across the fabric, which Artificial Analysis says yields clear gains in both raw compute and Agents per Megawatt versus single-node deployments ^[1]. The numbers bear this out: the rack-scale GB300 NVL72 is roughly 3x more power-efficient than a single-node B300 (~61,354 vs ~21,053 Agents/MW on the easiest SLO tier) and up to 20x more efficient than the previous-generation H200 ^[1]^[2]. In other words, the win is not only a generational chip upgrade from Hopper to Blackwell; it is the fabric that lets a long-context agent's working state spread across 72 GPUs without leaving the rack.

By the numbers: the efficiency gap across four platforms

On DeepSeek V4 Pro at the easiest SLO tier (20 tokens/s, 10s time-to-first-token), the spread is wide: GB300 NVL72 at ~61,354 Agents/MW, the single-node B300 at ~21,053, AMD's MI355X at ~3,551, and the Hopper-generation H200 at ~2,594 ^[1]^[2]. That places the GB300 NVL72 roughly 17x ahead of the MI355X and about 24x ahead of the H200 on this tier, with NVIDIA's own '20x more agents per megawatt than H200' headline sitting in the same range ^[2]. The single-node-to-rack-scale jump (B300 to GB300 NVL72) is itself nearly 3x, isolating how much of the gain comes from rack-scale disaggregation rather than the chip alone ^[1]. The shape of the data is the story: this is not a tight race with marginal leads but an order-of-magnitude separation that scales further the harder the service-level tier.

Read the fine print: why the AMD numbers may understate reality

The launch results are a snapshot of immature software, and Artificial Analysis says so explicitly. It cautions that DeepSeek V4 Pro kernel optimizations and config design on AMD systems are in 'relative infancy,' and that it expects significant improvements in AMD performance in the near term ^[1]. That matters because the same production optimizations the benchmark permits, speculative decoding and disaggregated serving, depend on mature, well-tuned kernels, and a two-week-old software stack will leave performance on the table that a hardware spec sheet would not predict. The honest reading is that the GB300 NVL72's lead on rack-scale architecture is real and structural, but the AMD-vs-NVIDIA single-chip gap is partly a software-maturity gap that should narrow as MI355X configurations are tuned ^[1]^[3]. Buyers comparing platforms on day-one numbers risk locking in a comparison that the next round of kernel work may meaningfully redraw.

Historical Context

2026-06-12

Released the first AA-AgentPerf results on DeepSeek V4 Pro, showing NVIDIA Blackwell dominance in Agents per Megawatt.

2026-06-12

AA-AgentPerf extends Artificial Analysis's original hardware benchmark AA-SLT (single/medium-prompt throughput, speed, cost) to long-context multi-turn agentic workloads.

Power Map

Key Players

Subject

AA-AgentPerf benchmark and NVIDIA Blackwell efficiency

Artificial Analysis

Created and operates AA-AgentPerf; sets methodology, SLO tiers, and per-megawatt normalization. Its framing of power as the binding constraint shapes how buyers compare accelerators.

NVIDIA

Top performer on launch results with the GB300 NVL72 (Blackwell Ultra); uses the benchmark to validate rack-scale Blackwell and disaggregated prefill/decode against Hopper and AMD.

AMD

Competing accelerator vendor; the MI355X trailed Blackwell, though Artificial Analysis expects near-term AMD gains as DeepSeek V4 Pro kernel optimizations mature.

Inference providers (Together AI, DeepInfra, Baseten)

Cited as serving real agentic workloads on Blackwell, demonstrating the production relevance of the benchmark.

Fact Check

4 cited

Source Articles

Top 3

THE SIGNAL.

Analysts

The Crowd

"Today we're releasing the first results for AA-AgentPerf, our new agentic inference benchmark: initially covering DeepSeek V4 Pro across NVIDIA Blackwell, Hopper, and AMD. AA-AgentPerf is the first benchmark built for agentic inference. We use real, long-context agentic coding workloads."

@@ArtificialAnlys237

"Introducing AA-AgentPerf - the hardware benchmark for the agent era. Key details: ➤ Real agent workloads, not synthetic queries: we’ve captured real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens ➤ Production"

@@ArtificialAnlys162

"NVIDIA says its Blackwell Ultra is optimized for agentic AI, delivering 50× higher throughput per megawatt than H200s, 35× lower cost per million tokens, and 1.5× lower cost per token vs. GB200 NVL72."

@@wccftech698

Broadcast

Anthropic Models Suspended + NVIDIA Blackwell’s 20x Agent Leap

Adwaizer · AI News — DEEP DIVE · AGENTIC INFRA BENCHMARK · JUN 12