Agentic Brew Daily
Your daily shot of what's brewing in AI
Fresh Batch
- The agent stack is consolidating around context, sandboxes, and harnesses, with Google shipping Managed Agents while builders publish post-mortems on tool bloat and RAG defaults.
- Anthropic is simultaneously the safety story and the security risk: Glasswing claims 10,000 vulnerabilities found while X warns that public skill marketplaces are the new supply-chain attack vector.
- AI demand is repricing the physical layer, with DDR5 up nearly 10x and Micron pouring $2B into Virginia DRAM as HBM steals wafer capacity from consumer memory.
Bold Shots
Today's biggest AI stories, no chaser
Sundar Pichai opened I/O 2026 declaring "the agentic Gemini era," shipping Gemini 3.5 Flash (76.2% on Terminal-Bench 2.1) as the new default and reframing every Google surface as a substrate for agents. Gemini Spark is a 24/7 personal agent that runs on Google Cloud VMs (not your laptop), gets its own Gmail address, and works across Gmail/Docs/Sheets/Slides while you're offline. Developers got Managed Agents in the Gemini API where one Interactions call spins up an ephemeral Linux sandbox running Bash, Python, and Node, and Search retired the ten-blue-links layout for generative UI. Within four days Adobe, Canva and CapCut shipped native Gemini integrations and SynthID verification rolled out across Search and Chrome.
Why it matters: This is the year Google stops competing as a chatbot and starts competing as an OS for agents. Spark's cloud-resident, Gmail-addressable design directly attacks the assumption that personal AI runs on your device, and the Search redesign is the single most consequential change to the open web in a decade.
We're dropping Gemini Omni: our first step towards a model that can create anything from anything - starting with video.
Gemini Spark is your new 24/7 personal AI agent. Give it a task and it works autonomously in the background, even if your phone a...
Jensen Huang told CNBC alongside Q1 FY2027 earnings that Nvidia has "largely conceded" China's AI chip market to Huawei. Nvidia reported $0 China data-center revenue and zero Hopper shipments into China for the quarter against $4.6B in the year-ago period, even as total revenue hit $81.6B (up 85% YoY). On May 22 Taiwan's Keelung District Prosecutors moved to detain three men for allegedly forging documents to ship Super Micro AI servers with Nvidia chips into Hong Kong, Macau, and the mainland; investigators seized about 50 servers and NT$9M in cash. Huang publicly urged Super Micro to tighten export-control compliance and pitched a new $200B Vera CPU TAM, with CFO Colette Kress guiding to $20B in 2026 CPU revenue.
Why it matters: Three years of US export controls have ended with Nvidia at zero China share and Huawei guiding to $12B in AI chip revenue this year. The Taiwan smuggling crackdown shows the gray market is large enough that even allies are now policing it, and Nvidia's CPU reframing is how the stock survives the China zero.
President Trump postponed the signing of an AI safety executive order on Thursday, May 21, hours before the planned Oval Office ceremony, after last-minute lobbying from tech leaders. The cancelled order would have set a voluntary framework requiring AI developers to submit frontier models for federal security review up to 90 days before release. Reporting reconstructs David Sacks leading the lobby; OpenAI and Anthropic backed the order while Meta and xAI led the push to kill it. Trump framed the reversal around not slowing the AI race against China. FLI polling cited by Fortune shows 79% of Republican voters favor pre-release government testing.
Why it matters: A frontier-lab fissure is now visible in policy. Incumbents that already do internal red-teaming wanted the order; challengers without that overhead killed it. The Silicon Valley veto power on display also runs straight against the MAGA base's own polling on AI oversight.
The White House is considering a slate of executive actions to address escalating security risks from advanced AI models, per 7 pp...
NEW: President Trump abruptly delays the signing of a landmark executive order on AI, telling reporters that he had pulled the ord...
Anthropic launched Project Glasswing, powered by its unreleased Claude Mythos Preview model, which uncovered more than 10,000 high or critical-severity vulnerabilities in essential software in its first month, with 90.6% of triaged findings (1,587 of 1,752) confirmed as true positives. The bottleneck has shifted from discovery to verification and patching: high-severity bugs average a two-week patch time and more than 99% of Mythos-found vulnerabilities remain unpatched. Launch partners span AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks, backed by $100M in Mythos credits and $4M in OSS security donations. Notable discoveries include critical wolfSSL CVE-2026-5194 (5B+ devices), a 27-year-old OpenBSD bug, and a 16-year-old FFmpeg flaw.
Why it matters: Glasswing inverts the cybersecurity economy. Bug discovery is now infinite; remediation is the new scarce resource. Anthropic gating Mythos behind a 50-partner cartel also raises real antitrust and equity questions about who gets to defend critical software.
Anthropic just published the first Project Glasswing update. In one month, their unreleased AI found 10,000 critical security hole...
Introducing Project Glasswing: an urgent initiative to help secure the world's most critical software. It's powered by our newest...
The NTSB temporarily suspended public access to its entire online docket system on May 21 after internet users used AI to reconstruct cockpit voice recorder audio from a spectrogram image released in the UPS Flight 2976 investigation. Federal law bars NTSB from releasing raw audio, but the spectrogram (a visual frequency/time image) contained enough data for reconstruction. The pipeline combined the Griffin-Lim phase-recovery algorithm (published 1984) with modern AI tools including OpenAI's Codex, using the publicly available transcript as a prior. NTSB restored most of the docket on Friday but kept 42 investigations sealed pending review.
Why it matters: Decades-old data-release policies assumed the line between "visual" and "audio" was a real privacy boundary. AI coding agents have erased it. A graduate-level signal-processing pipeline that took weeks now takes hours, and every agency that ever published a spectrogram or frequency plot of restricted data has the same problem.
The NTSB is aware that advances in image recognition and computational methods have enabled individuals to reconstruct approximati...
Cockpit audio reconstructions of the Nov. 2025 UPS MD-11 crash have surfaced on sites like Reddit after a spectrogram file of the...
Slow Drip
Blog reads worth savoring
Per-segment eval gating beats one aggregate number because in-batch negatives can make embeddings anisotropic and halve recall for a single client while global metrics look fine.
OpenAI, DeepSeek, and AI21 are reorganizing around agents this week, with DeepSeek's 75% price cut and managed sandboxes pointing at the new infra layer.
Install Arize skills via `npx skills add Arize-ai/arize-skills` and let Claude propose, group, and iteratively repair failed evals as a self-improving loop.
Open-weight 3B/8B/14B diffusion LMs flip between autoregressive, diffusion, and self-speculation modes in one architecture for 6.4x faster decoding without retraining.
The Grind
Research papers, decoded
Apple's team stress-tests Claude 3.7 Sonnet Thinking, DeepSeek-R1, and o3-mini on four controllable puzzle environments and finds three regimes: standard LLMs win at low complexity, LRMs win at medium, and both collapse to near-zero at high complexity. More damning, models reduce reasoning tokens as they approach failure, and handing them the explicit algorithm does not improve performance.
GRAM injects stochasticity into recursive reasoning so models sample multiple latent trajectories in parallel instead of one deterministic path. Trained via amortized variational inference, it hits 97.0% on Sudoku-Extreme (vs 87.4% deterministic baseline) and 99.7% on N-Queens 8x8 while covering 90.3% of valid solutions, establishing 'width' as a new inference-time scaling axis.
Reframes code as the operational substrate for agents with a three-layer taxonomy: Harness Interface (code as reasoning, action, environment model), Harness Mechanisms (planning, memory, tools, verification, self-optimization), and Scaling (multi-agent coordination via shared code artifacts). Useful design checklist for what to externalize as code vs keep in prompts.
Inference-time-only modification to Tiny Recursive Models: inject Gaussian noise into latent state at each recursion to spawn K parallel rollouts, then reuse the model's existing Q-head as a verifier to pick the best. A 7M-parameter PTRM hits 91.2% on PPBench at ~$0.001/attempt, beating Claude-Opus + Gemini-3.1-Pro ensembles (55.1%) at $2.66/attempt.
Multi-agent research system using role-specialized debate (Innovator / Pragmatist / Contrarian), Pivot/Refine self-healing, a numeric registry with four-layer citation verification, and configurable human-in-the-loop modes. CoPilot mode (targeted checkpoints) hit 7.27/10 quality with 87.5% acceptance, beating both autopilot and constant oversight on ARC-Bench.
The Mill
Builder tools ground for action
Anthropic-managed directory of vetted Claude Code plugins; the official answer to the security mess around third-party skill marketplaces.
Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, OpenCode, and Hermes Agent that cuts tokens and tool calls by giving agents a structured map of the codebase.
Turns any code repo into an interactive, searchable knowledge graph — 'graphs that teach > graphs that impress.'
A single CLAUDE.md distilled from Karpathy's published observations on LLM coding pitfalls; viral one-file drop-in for Claude Code.
Official Chrome DevTools MCP server letting coding agents drive a real browser for inspection and debugging.
Fleet of parallel agents that test your app in minutes; riding the agentic-QA wave.
AI models on an inference cloud optimized for speed; pitched as a Cerebras/Groq-style alternative for builders.
The Counter
Voices from the AI bar today
MatX CEO and ex-Google TPU architect walks from basic logic gates up through full chip architecture: rare technical depth on what goes into a competitive AI accelerator.
Breakdown of Princeton's Continual Harness, an agent that self-improves during live execution by rewriting its own instructions, building tools, and storing memories.
Tonight's most-shared learning thread: one Stanford lecture teaches you more about how ChatGPT/Claude actually work than most engineers ever learn.
Open-source tool that feeds in a city name and spits out a 3D model with buildings and streets via OpenStreetMap; 100% open source.
Practitioner cheatsheet of Claude Code workflow tips: compounding tribal knowledge readers actually save.
Heated thread treating Salesforce's token spend as the canary for AI-replaces-headcount; 487 comments shows the nerve it hit.
Roast Calendar
Your AI week, day by day
Last Sip
Parting thoughts
The through-line of the day is the harness, not the model. Apple's reasoning-models paper says current LRMs collapse beyond a complexity cliff and spend fewer tokens as they approach it. Google answered the same problem at I/O by externalizing reasoning into a per-call Linux sandbox. Anthropic's Glasswing did it by letting Mythos churn through real codebases for a month and producing more bugs than humans can patch in a year. Meanwhile a 7M-param recursive model with a noise injector and a recycled Q-head beat frontier ensembles on PPBench for a tenth of a cent. The bet that's winning right now isn't a smarter model. It's an opinionated structure around an okay model that lets you compound effort. If you're picking what to invest in this week, that's probably the lens to use.