Agentic Brew Daily
Your daily shot of what's brewing in AI
Fresh Batch
- Sacks, Musk, and Zuckerberg killed Trump's AI executive order, and Newsom signed California's worker-displacement order the same day Meta cut 8,000 jobs.
- Anthropic now runs Claude across AWS Trainium, Google TPUs, Nvidia GPUs, and SpaceX Colossus at $1.25B per month, diversifying onto four silicon stacks at once.
- Code-as-agent-harness became a product category in one week, as OpenAI shipped Goal Mode GA, Cursor opened Composer 2.5 as an SDK, and Anthropic made memory an API primitive.
Bold Shots
Today's biggest AI stories, no chaser
A draft executive order asking frontier labs to voluntarily submit advanced models to federal national-security agencies for 14-90 days of pre-release review was on Trump's desk Thursday morning. By the time the news cycle ended, David Sacks, Elon Musk, and Mark Zuckerberg had each phoned the White House to argue the framework would become a de facto licensing regime, and Trump tabled the signing. The reversal leaves OpenAI, Anthropic, Google, and Microsoft with no federal floor for frontier-model safety review, even as Treasury, NSA, and CISA had concluded these models can find production-system vulnerabilities. The political split is not Republican vs Democrat: 79% of GOP voters want pre-release testing, and Steve Bannon plus 60+ MAGA signatories signed a Humans First letter demanding mandatory testing.
Why it matters: A morning of phone calls killed a year of interagency work and ceded the safety-review agenda to state legislatures and to Beijing, which is advancing comprehensive AI legislation in parallel. The donor faction of MAGA beat the base faction on the call that mattered.
Anthropic will pay SpaceX roughly $45B over three years, or $1.25B per month through May 2029, for 300+ MW of compute at xAI's Colossus data centers in Memphis. The deal covers 222,000+ Nvidia GPUs across H100, H200, and GB200, stacks on top of Anthropic's $100B+ AWS Trainium commitment and its multi-gigawatt Google TPU deal, and was disclosed inside SpaceX's S-1 IPO filing targeting a $1.75T valuation. Either party can terminate on 90 days notice, and Musk personally retained a discretionary clause to reclaim compute if Anthropic's AI is judged to harm humanity.
Why it matters: This is the public price tag for frontier-scale infrastructure: $15B per year, single customer, single supplier, paid to a direct competitor. It validates the SpaceX/xAI IPO thesis and makes Anthropic's reliance on a Musk-controlled supplier with a unilateral kill switch a board-level question.
On May 20, OpenAI announced that an internal general-purpose reasoning model produced a disproof of the 1946 Erdos planar unit-distance conjecture, constructing an infinite family of n-point configurations with n^(1+delta) unit-distance pairs that polynomially beats the long-assumed near-linear bound. Princeton's Will Sawin sharpened delta to at least 0.014 the same day. Fields Medalist Tim Gowers and Oxford's Thomas Bloom co-authored the 19-page companion paper, with Gowers saying that if a human had submitted this to the Annals of Mathematics he would have no doubt it was a milestone.
Why it matters: This is the first time a general-purpose AI has autonomously produced a frontier mathematical result that would have cleared peer review on its own. The methodology bridges plane geometry to algebraic number theory via class field towers, which is a new constructive branch, not a heuristic. Daniel Litt called it the unique interesting result produced autonomously by AI so far.
Pichai opened I/O declaring the agentic era and reframing Search as an agent manager, the most consequential structural shift in Google's core product in two decades. Gemini 3.5 Flash is now the default model in the Gemini app and AI Mode in Search worldwide, with AI Mode exceeding 1 billion monthly users. Demis Hassabis unveiled Gemini for Science: AI Co-Scientist plus AlphaEvolve plus Science Skills connecting agentic platforms to 30+ life-science databases, with BASF, Klarna, Daiichi Sankyo, Bayer Crop Science, Stanford Medicine, and U.S. National Labs as partners. Antigravity 2.0 shipped as a desktop app, CLI, and SDK orchestrating parallel autonomous coding agents at 12x the public API speed.
Why it matters: The full-stack play (8th-gen TPUs, Gemini 3.5, Search/Android/Workspace distribution, ad inventory) lets Google monetize the agentic transition in ways pure-play labs can't. The unstated losers are Booking, Expedia, DoorDash, Zillow, and Instacart, marketplaces that get bypassed when agentic Search completes the transaction in the SERP.
The Commerce Department announced letters of intent with nine quantum companies on May 21, totaling $2.013B in CHIPS-and-Science-Act funding, with the federal government taking a minority equity stake in each recipient. IBM gets $1B (plus a $1B match) to launch Anderon, America's first pure-play 300mm quantum wafer foundry in Albany. GlobalFoundries gets $375M. Atom Computing, D-Wave, Infleqtion, PsiQuantum, Quantinuum, and Rigetti each receive up to $100M, with Diraq getting $38M. D-Wave jumped 33%, Rigetti 31%, and IBM 12% the same day.
Why it matters: The government just took equity in nine quantum companies and underwrote two new foundries, modeled on the Intel CHIPS deal. Jefferies reads it as a direct response to China's state-backed quantum push. The multi-modality bet hedges across superconducting, trapped-ion, neutral-atom, silicon-spin, photonic, and topological hardware, which means no winner has been picked yet.
Slow Drip
Blog reads worth savoring
Maps the $18B EDA market with hard numbers: token licensing yields ~20% revenue uplift on flat headcount, foundry-mandated tool flows lock in 95%+ retention, and China's share climbs as Synopsys' China revenue slips from 16% to 12%.
Names ten concrete failure modes in Cursor and Claude Code (doom loops from stale observations, plan-mode that still writes, tool outputs eating 70-80% of context) with a four-layer architecture model to debug them.
Stores the full KV cache on CPU and keeps only a Top-k LRU buffer on GPU, cutting per-request GPU memory from 8GB to 200MB at 128K context and 5x-ing batch throughput.
Three fresh infra unicorns in one day: Exa ($250M at $2.2B), Modal ($355M at $4.7B), and TurboPuffer hitting $100M ARR profitably 19 months after first $1M while raising under $1M.
The Grind
Research papers, decoded
Apple researchers stress-tested frontier thinking models (Claude 3.7 Sonnet Thinking, DeepSeek-R1, o3-mini) inside controllable puzzles where they could dial complexity step by step. They found three sharp regimes: on easy problems, vanilla LLMs beat reasoning models for the same compute; on medium problems, reasoning models pull ahead; on hard problems, both collapse to near-zero accuracy. Counterintuitively, the reasoning models cut their thinking tokens as problems get harder, even with budget remaining. Don't pay the thinking tax for low-complexity tasks, and set a hard fallback policy past your domain's complexity ceiling.
Inference-time-only trick for Tiny Recursive Models: inject Gaussian noise at each recursion step to run K parallel trajectories, then pick the winner using the Q-head that TRM already trained but normally throws away. No retraining, no task-specific augmentation. Sudoku-Extreme jumps from 87.3% to 98.75%, Pencil Puzzle Bench from 62.6% to 91.2%, beating an ensemble of seven frontier LLMs at roughly $0.001 per attempt with a 7M-parameter model. Width scaling via parallel noisy rollouts and the internal verifier head is a free accuracy lever before retraining.
Converts deterministic Recursive Reasoning Models into probabilistic ones via amortized variational inference, modeling reasoning as a stochastic latent trajectory so the model can pursue multiple hypotheses in parallel and scale inference-time compute through depth and trajectory sampling. 97.0% on Sudoku-Extreme vs TRM's 87.4%, 99.7% on 8x8 N-Queens with 90.3% solution coverage, plus an unconditional-generation mode (99.05% valid Sudoku boards from empty inputs) that deterministic baselines can't do. The trained-from-scratch counterpart to PTRM.
A survey that reframes code in LLM agents from the thing the agent writes to the operational substrate the agent runs on, covering how code mediates reasoning (Program-of-Thoughts, PAL), acting (Code-as-Policies, Voyager), environment modeling (SWE-bench, WorldCoder), and multi-agent coordination (ChatDev, MetaGPT). Organizes the field into three layers (interface, mechanisms, scaling) and names the missing engineering discipline: harness engineering. If you're building agents, the bottleneck is the harness, not the LLM.
Predicts how a candidate LLM will do on downstream tasks without running benchmarks: aggregate token-level statistics (entropy, top-k accuracy, expert-token rank) from the model's next-token distribution over expert-written solutions. Spearman rho 0.81 vs 0.36 for cross-entropy when ranking 18 heterogeneous models; reliable ranking of 25 pretraining corpora at ~10,000x less compute than direct evaluation. If you're picking between checkpoints, base models, or pretraining mixes, score a handful of expert solutions and aggregate token-level signals.
The Mill
Builder tools ground for action
Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, and OpenCode. Fewer tokens, fewer tool calls, 100% local. The drop-in knowledge layer in front of your coding agent.
Anthropic's official, managed directory of high-quality Claude Code plugins. Brand-new repo, instantly the canonical install source for Claude Code extensions.
Turns any codebase into an interactive knowledge graph you can explore, search, and ask questions about. Works with Claude Code, Codex, Cursor, Copilot, and Gemini CLI.
Learn it. Build it. Ship it for others. A free, from-first-principles AI engineering curriculum that keeps climbing as bootcamp alternatives stay popular.
Run one-person companies entirely with AI agents. Astra (an AI CEO) manages 10+ pre-built agents (CMO, CTO, etc.) and orchestrates Claude Code and Hermes under the hood. Hand it a KPI like 10x traffic this month and it plans, assigns, and reports back.
Self-updating knowledge bases. Pre-built automations that keep docs, changelogs, and translations current automatically whenever the product changes.
Vibe-code apps with the safety net of a no-code editor. Prompt AI to generate an app, then refine screens, workflows, and DB in a visible no-code editor. No more black box.
Product demo videos, recorded by your AI agent. Drives your web app over MCP and returns a polished demo video plus GIF with zooms, cursor motion, and intro animation.
The Counter
Voices from the AI bar today
How Man Group put AI-generated trading signals into live production under a skills framework plus core data layer, with compliance oversight, and scaled it across 750+ developers. A real template for deploying compliant agents in regulated industries.
Benchmarks Gemini 3.5 Flash on the CARE benchmark (~75% planning, 46% intent recovery) and stress-tests Antigravity 2.0's agent-first, no-IDE workflow. Honest take on what works and what's still painful.
Not a demo. Not a benchmark. The future already runs locally on dead hardware from 1999. Same topic surfaces a $1,472 1B-param model matching 7x peers and 66M-param TTS beating ElevenLabs on a Raspberry Pi.
The June release rumor mill, spotlighting Musk's Macrohard, a purely-AI software company under xAI positioned against Microsoft.
A senior engineer lays out a vibe-coding playbook: plan mode, iterative validation, version control, and forced test generation. The rules that make hands-off AI builds actually ship.
Field-tested Claude tricks across Projects, Custom Styles, and subagents. A tactical cheat sheet for power users that hit the front page this morning.
Roast Calendar
Your AI week, day by day
Last Sip
Parting thoughts
A single morning of phone calls killed a year of federal interagency work, a chat model wrote a paper that cleared a Fields Medalist's smell test, and a wafer foundry is being built in Albany on a government equity check. The throughline is that the people who decide what AI does next now sit in three rooms: a White House where one call moves policy, a Mountain View stage where Search becomes an agent manager, and a Mila lab where 7M-parameter models beat ensembles of frontier LLMs for a tenth of a cent per attempt. Pick which room you're building for, and pick on purpose.