TECH

Claude Code Performance Degradation and Fixes

32+

Signals

Strategic Overview

01.
Anthropic's April 23 postmortem attributes recent Claude Code quality regressions to three independent harness bugs — not the underlying model — affecting Claude Code, the Claude Agent SDK, and Claude Cowork across Sonnet 4.6, Opus 4.6, and Opus 4.7.
02.
The three bugs were a March 4 reasoning-effort downgrade from 'high' to 'medium', a March 26 caching feature that wiped thinking every turn instead of after one hour idle, and an April 16 verbosity prompt capping inter-tool text at 25 words that cost roughly 3% on coding evals.
03.
Anthropic reverted all three changes (April 7, v2.1.101 on April 10, v2.1.116 on April 20) and reset usage limits for every subscriber on April 23 as remediation.
04.
Separately, Claude Opus 4.7's tightened Acceptable Use Policy classifiers are producing false-positive refusals on legitimate developer work, including Russian-language prompts, computational structural biology, cybersecurity lab materials, and a Hasbro Shrek toy-ad PDF.

How Three Unrelated Patches Stacked Into One Quality Cliff

The postmortem's central, counterintuitive finding is that Claude Code's six-week slump was not one bug but three — each shipped by a different team, each plausible in isolation, and each interacting with the others in ways that defeated Anthropic's normal attribution tools. On March 4, a latency-focused change lowered the default reasoning effort from 'high' to 'medium'; Anthropic later admitted 'this was the wrong tradeoff.' On March 26, a prompt-caching feature meant to clear thinking from sessions idle more than an hour instead cleared thinking every turn — so Claude, in Anthropic's own words, 'would continue executing, but increasingly without memory of why it had chosen to do what it was doing.' On April 16, an innocuous-looking system instruction — 'Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.' — quietly taxed the model's working explanation space and dragged coding evals down ~3%.

Individually each change was defensible; together they produced compounding, overlapping symptoms — shallow reasoning, amnesia, clipped explanations — which is why Anthropic called this 'the most complex investigation we've had.' The lesson for every team shipping an agentic harness is that harness regressions do not behave like model regressions. They drift in, masquerade as model variance, and resist A/B rollback because the confounders overlap. Anthropic's eventual fixes were precise (revert April 7, v2.1.101 on April 10, v2.1.116 on April 20), but the six weeks between first symptom and last fix is the real story.

By The Numbers: The AMD Telemetry Audit That Turned 'Vibes' Into a Bug Report

The postmortem would not exist without Stella Laurenzo, AMD's Senior Director of AI, who on April 2 published a GitHub analysis of 6,852 Claude Code sessions, 234,760 tool calls, and 17,871 thinking blocks. Her numbers converted diffuse developer grievance into something Anthropic engineers could not wave away: files read before editing fell from 6.6 to 2.0, blind edits — edits with no prior read — surged from 6.2% to 33.7%, full-file rewrites rose from 4.9% to 11.1%, and stop-hook violations (outright task abandonment) went from zero to 173 in seventeen days. Thinking visibility collapsed from 100% on March 4 to under 1% by mid-month, and median visible thinking length dropped from roughly 2,200 characters to 600.

Before the audit, Claude Code lead Boris Cherny had publicly pushed back on nerfing claims on X. After it, the conversation shifted from 'is this real' to 'which harness path is causing this.' Laurenzo's framing was sharp — 'Claude cannot be trusted to perform complex engineering tasks' — and her causal story, that 'when thinking is shallow, the model defaults to the cheapest action available,' matched the symptom pattern almost exactly. It is a small case study in what it takes to force a vendor investigation in the agentic era: you need instrumented sessions at scale, not screenshots.

Trust Debt: Vindication Without Reconciliation

Community reception of the postmortem is split between vindication and residual suspicion. On Reddit, the top r/ClaudeCode thread on the postmortem clears thousands of upvotes with a 'we weren't imagining it' tone, while a parallel 'Opus 4.7 is legendarily bad' post — heavy on hallucination and gaslighting complaints — suggests the fixes haven't converted skeptics, and a widely circulated workaround guide instructs users to downgrade to Opus 4.6 via the CLI rather than trust 4.7. On developer YouTube, the dominant frame is trust crisis rather than closure: Theo's 'Claude Code is unusable now' pulled an outsized audience in April, and the Prompt Engineering channel treated the postmortem itself as a transparency watershed. On X, Gergely Orosz captured the mood by warning that Claude Code was 'burning dev goodwill,' while others read the same-day reset of usage limits as an apology reflex borrowed from OpenAI's playbook rather than a real concession.

What this sentiment shape reveals is that the postmortem solved the technical problem before it solved the trust problem. Developers spent six weeks being told their perception was wrong; a retroactive explanation, even a detailed one, does not fully close that loop. That is why the rate-limit reset landed ambiguously — generous on its face, but timed in a way that invited cynicism.

The Second Crisis Hiding Under the First: Opus 4.7's AUP Overreach

While the postmortem closes the book on the harness bugs, it quietly leaves a second, unresolved regression open: Claude Opus 4.7's new Acceptable Use Policy classifiers are producing false-positive refusals on legitimate developer work. The Register documents refusals on Russian-language prompts, computational structural biology work, cybersecurity lab materials, and — most absurdly — a Hasbro Shrek toy advertisement PDF, where 'the trigger was PDF syntax translating to CHARACTER OR FOR DONKEY UNDERNEATH.' The volume shift is itself telling: GitHub reports of AUP false positives went from two to three per month between July and September 2025 to more than thirty in April 2026 alone.

The developer response has been unusually pointed. Golden G. Richard III, director of the LSU Cyber Center, argued that 'for $200+ per month, basic help with editing tasks will not be rejected...the model refusing to proofread a lab containing simple crypto exercises is absurd.' Unlike the harness bugs, this regression lives inside the model's safety layer rather than the product scaffolding, which means there is no clean revert — Anthropic has to retune classifier thresholds against an adversarial baseline. It also means the paying customers most likely to hit refusals are precisely the technical audiences Claude Code is courting: security researchers, bioinformaticians, multilingual users. The postmortem's credibility bump will hold only as long as this parallel crisis stays contained.

What This Reveals About Agentic Harness Reliability

Simon Willison's gloss on the postmortem — that 'the kinds of bugs that affect harnesses are deeply complicated, even if you put aside the inherent non-deterministic nature' — is worth taking seriously as the agentic-systems industry scales. A model is a relatively fixed artifact; a harness is a living product surface of system prompts, caching rules, tool schemas, reasoning-effort dials, and verbosity shims, and every one of those knobs can silently change the behavior users experience without any change to the underlying weights. Anthropic's three bugs were, in effect, three different categories of harness failure — a capability dial (reasoning effort), a memory policy (caching clear window), and an output constraint (length limit) — which suggests the attack surface for regressions is much wider than 'we updated the model.'

The remediation commitments Anthropic pledged — stronger evals, tighter prompt controls, and better communication channels — are tactical, but the structural implication is that agentic vendors now need regression testing that treats harness changes as first-class shipping artifacts, with eval coverage that specifically catches 'model is fine, product is broken.' For buyers and enterprise builders, the takeaway is inverse: the quality of a coding agent is no longer defined by the model card alone. It is the product of model × harness × version, and those three factors drift independently.

Historical Context

2026-03-04

Default Claude Code reasoning effort lowered from 'high' to 'medium' to reduce UI latency, unintentionally hurting Sonnet 4.6 and Opus 4.6 intelligence on complex tasks.

2026-03-26

Ships a prompt-caching feature intended to clear thinking from sessions idle over one hour, which instead clears thinking every turn.

2026-04-02

Publishes a GitHub telemetry audit of 6,852 Claude Code sessions quantifying severe behavioral regressions.

2026-04-07

Reverts the March 4 reasoning-effort downgrade after user feedback prioritizing intelligence over latency.

2026-04-10

Ships v2.1.101 fixing the caching bug that had been wiping thinking on every turn.

2026-04-16

Adds a verbosity-limiting system prompt (≤25 words between tool calls, ≤100 words final) that ends up costing roughly 3% on coding evaluations.

2026-04-20

Ships v2.1.116 removing the verbosity prompt, closing the third regression.

2026-04-23

Publishes the postmortem, denies intentional slowdown, resets rate limits for every subscriber, and The Register simultaneously reports Opus 4.7's tightened AUP classifiers are producing false-positive refusals with 30+ GitHub reports filed in weeks.

Power Map

Key Players

Subject

Claude Code Performance Degradation and Fixes

Anthropic

Vendor of Claude Code, Opus 4.6/4.7, and Sonnet 4.6; authored the postmortem, shipped all three fixes, and reset subscriber rate limits as remediation.

Claude Code paying subscribers

Pro and Max plan developers who absorbed six weeks of degraded output across March–April 2026 and whose rate limits were reset as compensation.

Stella Laurenzo

Senior Director of AI at AMD whose April 2 GitHub telemetry audit of 6,852 sessions quantified the degradation and forced Anthropic to investigate rather than deflect.

Boris Cherny

Claude Code lead at Anthropic who publicly pushed back on nerfing accusations on X before the postmortem confirmed product-layer regressions were real.

Golden G. Richard III

Director of the LSU Cyber Center, a prominent voice criticizing Opus 4.7's overzealous AUP refusals on routine cybersecurity education material.

Simon Willison

Independent developer and analyst whose weblog amplified the postmortem to builders of agentic systems and framed the harness-bug lessons for the broader community.

THE SIGNAL.

Analysts

"Called Claude Code untrustworthy for complex engineering, arguing shallow thinking pushed the model to the cheapest available action — 'edit without reading, stop without finishing, dodge responsibility for failures, take the simplest fix rather than the correct one.'"

Stella Laurenzo

Senior Director of AI, AMD

"Frames the postmortem as essential reading for agentic-system builders, noting that 'the kinds of bugs that affect harnesses are deeply complicated, even if you put aside the inherent non-deterministic nature.'"

Simon Willison

Independent developer, simonwillison.net

"Denies intentional degradation — 'We never intentionally degrade our models, and we were able to immediately confirm that our API and inference layer were unaffected' — and describes the investigation as the company's most complex to date."

Anthropic engineering

Postmortem authors, Anthropic

"Calls Opus 4.7's AUP classifier absurd for overblocking paying customers: 'I expect that for $200+ per month, basic help with editing tasks will not be rejected...the model refusing to proofread a lab containing simple crypto exercises is absurd.'"

Golden G. Richard III

Director, LSU Cyber Center

The Crowd

"Anthropic really is burning more and more dev goodwill Claude Code is suddenly getting unusable for stuff you could use it before (as in a day before!) and the AI now refuses to so stuff that it doesn't think is strictly to do with software development. No transparency why ofc"

@@GergelyOrosz2500

"Never going to hit my usage limits what with Anthropic now imitating OpenAI in 'we apologize and reset all users' limits' as a first-line customer service response to issues"

@@KelseyTuoc161

"Anthropic just published a postmortem explaining exactly why Claude felt dumber for the past month"

@u/Direct-Attention85972700

"Opus 4.7 is legendarily bad. I cannot believe this."

@u/lemon07r1900

Broadcast

Claude Code is unusable now

Claude Code has a big problem

Claude Code Downgrade? Here's What Actually Happened