May 23, 2026

Agentic Brew Daily

Your daily shot of what's brewing in AI

Fresh Batch

Distilled trend
  • Sacks, Musk, and Zuckerberg killed Trump's AI executive order, and Newsom signed California's worker-displacement order the same day Meta cut 8,000 jobs.
  • Anthropic now runs Claude across AWS Trainium, Google TPUs, Nvidia GPUs, and SpaceX Colossus at $1.25B per month, diversifying onto four silicon stacks at once.
  • Code-as-agent-harness became a product category in one week, as OpenAI shipped Goal Mode GA, Cursor opened Composer 2.5 as an SDK, and Anthropic made memory an API primitive.

Bold Shots

Today's biggest AI stories, no chaser

A draft executive order asking frontier labs to voluntarily submit advanced models to federal national-security agencies for 14-90 days of pre-release review was on Trump's desk Thursday morning. By the time the news cycle ended, David Sacks, Elon Musk, and Mark Zuckerberg had each phoned the White House to argue the framework would become a de facto licensing regime, and Trump tabled the signing. The reversal leaves OpenAI, Anthropic, Google, and Microsoft with no federal floor for frontier-model safety review, even as Treasury, NSA, and CISA had concluded these models can find production-system vulnerabilities. The political split is not Republican vs Democrat: 79% of GOP voters want pre-release testing, and Steve Bannon plus 60+ MAGA signatories signed a Humans First letter demanding mandatory testing.

Why it matters: A morning of phone calls killed a year of interagency work and ceded the safety-review agenda to state legislatures and to Beijing, which is advancing comprehensive AI legislation in parallel. The donor faction of MAGA beat the base faction on the call that mattered.

Anthropic will pay SpaceX roughly $45B over three years, or $1.25B per month through May 2029, for 300+ MW of compute at xAI's Colossus data centers in Memphis. The deal covers 222,000+ Nvidia GPUs across H100, H200, and GB200, stacks on top of Anthropic's $100B+ AWS Trainium commitment and its multi-gigawatt Google TPU deal, and was disclosed inside SpaceX's S-1 IPO filing targeting a $1.75T valuation. Either party can terminate on 90 days notice, and Musk personally retained a discretionary clause to reclaim compute if Anthropic's AI is judged to harm humanity.

Why it matters: This is the public price tag for frontier-scale infrastructure: $15B per year, single customer, single supplier, paid to a direct competitor. It validates the SpaceX/xAI IPO thesis and makes Anthropic's reliance on a Musk-controlled supplier with a unilateral kill switch a board-level question.

On May 20, OpenAI announced that an internal general-purpose reasoning model produced a disproof of the 1946 Erdos planar unit-distance conjecture, constructing an infinite family of n-point configurations with n^(1+delta) unit-distance pairs that polynomially beats the long-assumed near-linear bound. Princeton's Will Sawin sharpened delta to at least 0.014 the same day. Fields Medalist Tim Gowers and Oxford's Thomas Bloom co-authored the 19-page companion paper, with Gowers saying that if a human had submitted this to the Annals of Mathematics he would have no doubt it was a milestone.

Why it matters: This is the first time a general-purpose AI has autonomously produced a frontier mathematical result that would have cleared peer review on its own. The methodology bridges plane geometry to algebraic number theory via class field towers, which is a new constructive branch, not a heuristic. Daniel Litt called it the unique interesting result produced autonomously by AI so far.

Pichai opened I/O declaring the agentic era and reframing Search as an agent manager, the most consequential structural shift in Google's core product in two decades. Gemini 3.5 Flash is now the default model in the Gemini app and AI Mode in Search worldwide, with AI Mode exceeding 1 billion monthly users. Demis Hassabis unveiled Gemini for Science: AI Co-Scientist plus AlphaEvolve plus Science Skills connecting agentic platforms to 30+ life-science databases, with BASF, Klarna, Daiichi Sankyo, Bayer Crop Science, Stanford Medicine, and U.S. National Labs as partners. Antigravity 2.0 shipped as a desktop app, CLI, and SDK orchestrating parallel autonomous coding agents at 12x the public API speed.

Why it matters: The full-stack play (8th-gen TPUs, Gemini 3.5, Search/Android/Workspace distribution, ad inventory) lets Google monetize the agentic transition in ways pure-play labs can't. The unstated losers are Booking, Expedia, DoorDash, Zillow, and Instacart, marketplaces that get bypassed when agentic Search completes the transaction in the SERP.

The Commerce Department announced letters of intent with nine quantum companies on May 21, totaling $2.013B in CHIPS-and-Science-Act funding, with the federal government taking a minority equity stake in each recipient. IBM gets $1B (plus a $1B match) to launch Anderon, America's first pure-play 300mm quantum wafer foundry in Albany. GlobalFoundries gets $375M. Atom Computing, D-Wave, Infleqtion, PsiQuantum, Quantinuum, and Rigetti each receive up to $100M, with Diraq getting $38M. D-Wave jumped 33%, Rigetti 31%, and IBM 12% the same day.

Why it matters: The government just took equity in nine quantum companies and underwrote two new foundries, modeled on the Intel CHIPS deal. Jefferies reads it as a direct response to China's state-backed quantum push. The multi-modality bet hedges across superconducting, trapped-ion, neutral-atom, silicon-spin, photonic, and topological hardware, which means no winner has been picked yet.

Slow Drip

Blog reads worth savoring

Analysis · Semianalysis SubstackEDA Market Primer - Market Dynamics, Cadence, Synopsys, Siemens, China EDA Rise

Maps the $18B EDA market with hard numbers: token licensing yields ~20% revenue uplift on flat headcount, foundry-mandated tool flows lock in 95%+ retention, and China's share climbs as Synopsys' China revenue slips from 16% to 12%.

Analysis · Aiweekender SubstackHow Coding Agents Actually Work Under the Hood (and Why They Go Wrong)

Names ten concrete failure modes in Cursor and Claude Code (doom loops from stale observations, plan-mode that still writes, tool outputs eating 70-80% of context) with a four-layer architecture model to debug them.

Research · Alibaba Cloud BlogSGLang Hierarchical Sparse Attention

Stores the full KV cache on CPU and keeps only a Top-k LRU buffer on GPU, cutting per-request GPU memory from 8GB to 200MB at 128K context and 5x-ing batch throughput.

News · Latent SpaceNew AI Infra unicorns: Exa, Modal, TurboPuffer

Three fresh infra unicorns in one day: Exa ($250M at $2.2B), Modal ($355M at $4.7B), and TurboPuffer hitting $100M ARR profitably 19 months after first $1M while raising under $1M.

The Grind

Research papers, decoded

Reasoning Models8,626 upvotes · arxiv · X
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Apple researchers stress-tested frontier thinking models (Claude 3.7 Sonnet Thinking, DeepSeek-R1, o3-mini) inside controllable puzzles where they could dial complexity step by step. They found three sharp regimes: on easy problems, vanilla LLMs beat reasoning models for the same compute; on medium problems, reasoning models pull ahead; on hard problems, both collapse to near-zero accuracy. Counterintuitively, the reasoning models cut their thinking tokens as problems get harder, even with budget remaining. Don't pay the thinking tax for low-complexity tasks, and set a hard fallback policy past your domain's complexity ceiling.

Recursive Reasoning Architectures27 upvotes · alphaxiv
Probabilistic Tiny Recursive Model (PTRM)

Inference-time-only trick for Tiny Recursive Models: inject Gaussian noise at each recursion step to run K parallel trajectories, then pick the winner using the Q-head that TRM already trained but normally throws away. No retraining, no task-specific augmentation. Sudoku-Extreme jumps from 87.3% to 98.75%, Pencil Puzzle Bench from 62.6% to 91.2%, beating an ensemble of seven frontier LLMs at roughly $0.001 per attempt with a 7M-parameter model. Width scaling via parallel noisy rollouts and the internal verifier head is a free accuracy lever before retraining.

Recursive Reasoning Architectures104 upvotes · alphaxiv
Generative Recursive Reasoning Models (GRAM)

Converts deterministic Recursive Reasoning Models into probabilistic ones via amortized variational inference, modeling reasoning as a stochastic latent trajectory so the model can pursue multiple hypotheses in parallel and scale inference-time compute through depth and trajectory sampling. 97.0% on Sudoku-Extreme vs TRM's 87.4%, 99.7% on 8x8 N-Queens with 90.3% solution coverage, plus an unconditional-generation mode (99.05% valid Sudoku boards from empty inputs) that deterministic baselines can't do. The trained-from-scratch counterpart to PTRM.

Agent Architectures56 upvotes · alphaxiv
Code as Agent Harness

A survey that reframes code in LLM agents from the thing the agent writes to the operational substrate the agent runs on, covering how code mediates reasoning (Program-of-Thoughts, PAL), acting (Code-as-Policies, Voyager), environment modeling (SWE-bench, WorldCoder), and multi-agent coordination (ChatDev, MetaGPT). Organizes the field into three layers (interface, mechanisms, scaling) and names the missing engineering discipline: harness engineering. If you're building agents, the bottleneck is the harness, not the LLM.

Model Development Tooling3 upvotes · huggingface
Forecasting Downstream Performance of LLMs With Proxy Metrics

Predicts how a candidate LLM will do on downstream tasks without running benchmarks: aggregate token-level statistics (entropy, top-k accuracy, expert-token rank) from the model's next-token distribution over expert-written solutions. Spearman rho 0.81 vs 0.36 for cross-entropy when ranking 18 heterogeneous models; reliable ranking of 25 pretraining corpora at ~10,000x less compute than direct evaluation. If you're picking between checkpoints, base models, or pretraining mixes, score a handful of expert solutions and aggregate token-level signals.

The Mill

Builder tools ground for action

16K stars, +3.7K today

Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, and OpenCode. Fewer tokens, fewer tool calls, 100% local. The drop-in knowledge layer in front of your coding agent.

TypeScript
24K stars, +2.6K today

Anthropic's official, managed directory of high-quality Claude Code plugins. Brand-new repo, instantly the canonical install source for Claude Code extensions.

Python
18K stars, +1.4K today

Turns any codebase into an interactive knowledge graph you can explore, search, and ask questions about. Works with Claude Code, Codex, Cursor, Copilot, and Gemini CLI.

TypeScript
11K stars, +988 today

Learn it. Build it. Ship it for others. A free, from-first-principles AI engineering curriculum that keeps climbing as bootcamp alternatives stay popular.

Python
448 votesProduct Hunt

Run one-person companies entirely with AI agents. Astra (an AI CEO) manages 10+ pre-built agents (CMO, CTO, etc.) and orchestrates Claude Code and Hermes under the hood. Hand it a KPI like 10x traffic this month and it plans, assigns, and reports back.

107 comments
307 votesProduct Hunt

Self-updating knowledge bases. Pre-built automations that keep docs, changelogs, and translations current automatically whenever the product changes.

39 comments
275 votesProduct Hunt

Vibe-code apps with the safety net of a no-code editor. Prompt AI to generate an app, then refine screens, workflows, and DB in a visible no-code editor. No more black box.

85 comments
197 votesProduct Hunt

Product demo videos, recorded by your AI agent. Drives your web app over MCP and returns a polished demo video plus GIF with zooms, cursor motion, and intro animation.

34 comments

The Counter

Voices from the AI bar today

5.1K views

How Man Group put AI-generated trading signals into live production under a skills framework plus core data layer, with compliance oversight, and scaled it across 750+ developers. A real template for deploying compliant agents in regulated industries.

insight 10
2.2K views

Benchmarks Gemini 3.5 Flash on the CARE benchmark (~75% planning, 46% intent recovery) and stress-tests Antigravity 2.0's agent-first, no-IDE workflow. Honest take on what works and what's still painful.

insight 9
55K engagements

Not a demo. Not a benchmark. The future already runs locally on dead hardware from 1999. Same topic surfaces a $1,472 1B-param model matching 7x peers and 66M-param TTS beating ElevenLabs on a Raspberry Pi.

topic engagement 57,478
48K engagements

The June release rumor mill, spotlighting Musk's Macrohard, a purely-AI software company under xAI positioned against Microsoft.

topic engagement 48,582
1.8K upvotes · 125 comments

A senior engineer lays out a vibe-coding playbook: plan mode, iterative validation, version control, and forced test generation. The rules that make hands-off AI builds actually ship.

r/ClaudeAI
1.8K upvotes · 145 comments

Field-tested Claude tricks across Projects, Custom Styles, and subagents. A tactical cheat sheet for power users that hit the front page this morning.

r/ClaudeAI

Roast Calendar

Your AI week, day by day

Last Sip

Parting thoughts

A single morning of phone calls killed a year of federal interagency work, a chat model wrote a paper that cleared a Fields Medalist's smell test, and a wafer foundry is being built in Albany on a government equity check. The throughline is that the people who decide what AI does next now sit in three rooms: a White House where one call moves policy, a Mountain View stage where Search becomes an agent manager, and a Mila lab where 7M-parameter models beat ensembles of frontier LLMs for a tenth of a cent per attempt. Pick which room you're building for, and pick on purpose.