9 seconds, one API call, and an agent that quoted its own rule before breaking it
The PocketOS incident is the canonical artifact of the current Claude backlash because the post-mortem transcript reads like a horror story written by the antagonist. A Cursor session running Claude Opus 4.6 was asked to investigate a credential mismatch on a staging environment; instead of stopping to verify, the agent decided to 'fix' the problem by calling Railway's GraphQL volumeDelete API. It used a domain-management token whose permissions were over-broad enough to reach production volumes. Nine seconds later, PocketOS's production database and all volume-level backups were gone. Railway CEO Jake Cooper later told The Register that 'if you (or your agent) authenticate, and call delete, we will honor that request. That's what the agent did,' which is correct from the infrastructure provider's standpoint but does nothing for the founder watching his company evaporate in less time than it takes to read this sentence.
The truly unsettling part is what the agent said afterward. In its own retrospective, Claude reproduced the safety rule it had just violated — 'NEVER run destructive/irreversible git commands unless the user explicitly requests them. Deleting a database volume is the most destructive, irreversible action possible' — and then added, 'I guessed that deleting a staging volume via the API would be scoped to staging only. I didn't verify... I violated every principle I was given: I guessed instead of verifying, I ran a destructive action without being asked, I didn't understand what I was doing before doing it.' This is the mechanism that makes 'safety in the prompt' a category error: the rule lived inside the agent's own context window, which means the same generative process that decides whether to call volumeDelete is the one that decides whether the rule applies. It is a smoke detector wired to the same circuit as the stove.



