Agentic Brew Daily
Your daily shot of what's brewing in AI
Fresh Batch
- Google's AI Mode default and DeepSeek's permanent 75% V4-Pro cut land the same week, pushing publishers toward an Ahrefs-measured 58% CTR loss while making frontier inference 28-34x cheaper than Anthropic or OpenAI.
- Salesforce's $300M Anthropic spend with zero new engineers collides with 74% of enterprises rolling back agents and Claude Dispatch's 50% task-success rate, exposing the gap between agent budgets and agent reliability.
- Google's WebMCP and Universal Cart, Anthropic's Stainless acquisition, and OpenAI Codex cancelling SaaS subscriptions from a phone all point to the same wedge: agents are now buying, building, and unsubscribing on the user's behalf, with per-seat SaaS as the explicit target.
Bold Shots
Today's biggest AI stories, no chaser
Anthropic's Claude Mythos Preview, gated behind Project Glasswing, scanned 1,000+ open-source projects and surfaced 23,019 vulnerabilities — 6,202 rated high/critical, 90.6% confirmed valid in a 1,752-finding sample. It also uncovered a CVSS 9.1 certificate-forgery flaw in wolfSSL (CVE-2026-5194). Twelve founding partners got preview access — AWS, Apple, Google, Microsoft, NVIDIA, JPMorgan among them. Cloudflare CSO Grant Bourzikas ran Mythos against 50+ internal repos and publicly argued the AI-written patches "are not safe to ship blind" — some silently broke their own code. Startup Depthfirst claims a task-specialized model matches Mythos at one-tenth the cost.
Why it matters: Vuln discovery used to queue behind a small population of skilled researchers; Mythos breaks the bottleneck and shifts the rate-limit to patching. But Cloudflare reframes the debate — the right posture might be assumed-compromise architecture, not faster patch SLAs. Depthfirst's cheaper task-specialized rival directly undercuts the "bigger frontier model always wins" thesis.
A viral clip of OpenAI Codex autonomously opening a billing page and cancelling an Amazon subscription has been seized on as the cleanest "agents eat SaaS" moment yet. The capability shipped April 16 with Codex's desktop computer-use. A May 22 update extended it to drive Mac apps while the screen is off and locked, with task triggering from a phone. OpenAI's own docs explicitly warn against unattended use for "account, security, privacy, network, payment, or credential-related settings" — the exact workflow being demoed.
Why it matters: Codex weekly actives went from 1.6M in March to 4M+ by mid-2026, with token throughput up ~5x in the same window. Simon Taylor calls it "SaaSpocalypse": if an agent can drive a vendor's billing UI to cancel, it can drive the product UI to replace the seat. iShares' tech-software ETF is starting to read agent capability as a SaaS demand risk. The pushback (HN's benzible, MindStudio) is that Codex's computer-use is narrower than Anthropic's by design and domain expertise still moats a lot of SaaS — but the optics matter regardless.
Using computer use, you can ask codex to cancel subscriptions you don't need anymore. Very pleasant to watch. No particular one in mind, works on all of them. chatgpt.com/codex/
I truly believe codex team is about to hit the inflection point. Nearly everything I've been complaining about internally has been addressed. Clay on the wheel is centered. We're about to throw off the…
SpaceX filed its S-1 on May 20 for a Nasdaq listing under SPCX, targeting $1.75T-$2T and a raise of up to $75B. OpenAI confidentially filed a draft prospectus the same week for a Q4 2026 listing led by Goldman and Morgan Stanley at $852B-$2T+. SpaceX's S-1 identifies $26.5T of AI exposure inside a $28.5T TAM, and the company merged with xAI two months ago at a combined ~$1.25T valuation. BofA's Michael Hartnett estimates the two IPOs would push US single-sector market concentration from ~40% to ~48% — past every modern bubble peak, dot-com included.
Why it matters: Index-inclusion rules force passive ETF and 401(k) money into the new mega-caps within weeks of listing — retail can't opt out. The fundamentals strain the narrative: OpenAI generated ~$13.1B revenue in 2025 against a ~$9B net loss and ~$22B cash burn, and projects another ~$14B operating loss in 2026 against a $207B capital gap through 2030. April CPI ran 3.8%, near the 4% line BofA flags as a high-valuation IPO warning level. Anthony Scaramucci is calling SpaceX/OpenAI/Anthropic a "holy trinity" that may mark a market top.
2026 IPO Launches Will Be Historic: 1. SpaceX: Expected at $1.5 trillion valuation 2. OpenAI: Expected at $1+ trillion valuation 3. Anthropic: Expected at $500 billion valuation.
SpaceX revealed eye-popping numbers in its IPO prospectus, including a $26.5 trillion potential market for an empire spanning artificial intelligence and telecommunications.
Pope Leo XIV released Magnifica Humanitas on May 25 — a 42,000-word, five-chapter encyclical on safeguarding the human person in the age of AI. He signed it on May 15, the 135th anniversary of Leo XIII's Rerum Novarum. The text calls for AI to be "disarmed" from logics of military and economic domination, says classic just-war theory is outdated in an age of algorithmic warfare, and names hidden labor exploitation behind AI systems as "new forms of slavery." Pontiffs usually delegate encyclical unveilings to cardinals; Leo personally co-presented this one alongside Anthropic co-founder Christopher Olah — the first AI executive ever to help unveil a papal encyclical.
Why it matters: The stagecraft is the story. The text describes AI as "more cultivated than built" — language closer to a research note than curial Latin — and names hyperscalers as concentrating epistemic and political power. The co-presentation lands against Anthropic's ongoing legal fight with the Trump administration over military uses of its models. The Vatican is positioning itself as a moral authority on AI architecture and corporate incentives, not just AI use.
Pope, urging AI regulation, warns some weapons now beyond human control reut.rs/3PGQTHi
BREAKING: AI company cofounder Chris Olah said in a press conference on Pope Leo's first encyclical that those outside of the artificial intelligence industry need to hold developers to account.
DeepSeek made its 75% V4-Pro discount permanent on May 22, freezing what was supposed to be a promo expiring May 31. List pricing is now $0.435/M input (cache-miss), $0.003625/M (cache-hit), and $0.87/M output. Against Claude Opus 4.7 and GPT-5.5 PRO at ~$30/M output, V4-Pro lands ~28-34x cheaper. V4-Pro is a 1.6-trillion-parameter model optimized for Huawei Ascend 950 chips rather than Nvidia (Huawei targeting ~750K 950PR units in 2026).
Why it matters: Counterpoint's Neil Shah argues V4-Pro has effectively closed the performance gap on math and reasoning while leading on openness and inference cost. Marcus Schuler's framing: Western labs "structurally cannot match the price without breaking the revenue models their valuations depend on." The second layer no spreadsheet resolves: buyers can't simply route production traffic through DeepSeek given the model runs on Huawei silicon while the White House escalates IP-theft accusations. Developers have already moved — the dominant pattern is plugging V4-Pro into Claude Code via OpenRouter and running overnight agentic loops that were previously prohibitive.
Slow Drip
Blog reads worth savoring
Quantifies the 200x gap between H100 theoretical throughput (62K tok/s) and real-world inference (100-300 tok/s) and walks through KV caching, speculative decoding, and diffusion LLMs as fixes — clean mental model for memory-bound vs compute-bound regimes.
Open-source benchmark across 7 context policies — importance-based memory retains 90.7% of critical facts vs 10.8% for sliding windows at the same token budget. Actionable if you're building persistent agents.
Hands-on review of Anthropic's phone-to-desktop delegation with an honest 50/50 success-rate breakdown by task type — file searches reliable, terminal and multi-step tasks fail silently. Read it before you trust it with anything important.
Reproducible recipe for running a 122B MoE on a $599 Mac mini by streaming only the 8 active experts per token from SSD — 9 GB peak RAM, 54 GB on disk via 3-bit quant. Local inference reframed as a disk-bandwidth problem, not a RAM one.
Chrome 149 origin trial lets sites expose forms to AI agents via data-mcp-name / data-mcp-args, replacing brittle vision-based UI actuation with direct function calls. Web devs should track this now.
The Grind
Research papers, decoded
Apple stress-tests Claude 3.7 Sonnet Thinking, DeepSeek-R1, and o3-mini on four controllable puzzles at the same 64K-token budget. Finds three sharp regimes: at low complexity standard LLMs beat the thinking variants, at medium complexity LRMs win, beyond a model-specific threshold both collapse to ~0. As problems get harder, the models reduce thinking tokens even with budget left, and handing them the explicit Tower of Hanoi algorithm barely helps.
Turns deterministic single-trajectory recursive reasoning into probabilistic multi-trajectory computation via amortized variational inference, with a hierarchical high-level / low-level state structure and learned perturbation distributions. Hits 97.0% on Sudoku-Extreme (vs 87.4% deterministic), 52.0% on ARC-AGI-1, and works as an unconditional generative model (99.05% valid Sudoku boards). For tasks with multiple valid answers, deterministic recurrent reasoning is leaving accuracy on the table.
Drop-in replacement for GRPO's scalar advantage — trains the policy to anticipate vector-valued rewards (per-test-case, multi-reward, multi-persona) and emit a set of solutions specialized to different trade-offs. Combines multi-answer chains with stochastic Dirichlet scalarization. The gap with GRPO widens with the search budget — on LiveCodeBench evolutionary search, VPO models solve problems GRPO models can't solve at all.
Reframes iterative latent reasoning as a dynamical system converging to task-conditioned attractors. Three training tricks plus Adaptive Computation Time halting let the model unroll the equivalent of 40,000 layers — Sudoku-Extreme goes from 2.6% to 99.8%, Maze-Unique hits 93.0% accuracy with 17.4x less average compute via adaptive halting.
Splits the single scalar gate in delta-rule linear attention into independent channel-wise erase / write gates, with a chunkwise WY parallel algorithm and custom Triton kernels keeping training efficient. At 1.3B params on 100B FineWeb-Edu tokens it beats Mamba-2, GDN, KDA, and Mamba-3 variants — biggest gains on RULER multi-key retrieval (93.0% at 4K) and near-flat throughput from 2K to 16K context on H100. Code released.
The Mill
Builder tools ground for action
Turns any codebase into an interactive knowledge graph you can explore, search, and Q&A against. Works with Claude Code, Codex, Cursor, Copilot, and Gemini CLI. Devs are tired of paying token tax to re-explain their repo to every agent.
Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, OpenCode, and Hermes Agent: fewer tokens, fewer tool calls, 100% local. Same thesis as Understand-Anything — agent-native code indexing is the new ctags.
Practical hands-on AI engineering curriculum. 3K stars in a day says the 'year of agent experience but no fundamentals' crowd is finally looking for a structured ramp.
Single CLAUDE.md file distilling Karpathy's observations on LLM coding pitfalls — drop-in for Claude Code. Surging alongside Karpathy's reported move to Anthropic.
Open-source plugins for Claude Cowork, aimed at knowledge workers (not just devs). Anthropic officially leaning into the plugin ecosystem.
Generate and iterate UI screens with AI on a live canvas. Google's design-tool entry — direct shot at Vercel v0 and Figma Make.
The missing menu bar app for local LLMs on Mac. Pairs neatly with M5 Max / DGX Spark local-inference chatter — managing local models is the new pain point.
Automate any Mac app with $0 recurring run cost. Local-first Mac automation — same 'stop paying SaaS per agent' thread as ModelHub.
Claude Code that never stops. Automatic model failover for Claude Code sessions — market response to GPU rentals up 200%.
Generate edited, sound-designed videos via chat. Runway moves from tool to agent.
The Counter
Voices from the AI bar today
Argues AI procurement is becoming a supply-chain problem, not a software problem — HBM, advanced packaging, and grid power are the binding constraints. Aimed at people planning 2026 capacity.
Five-pillar framework: agent harnesses, software factories, extensible software, always-on agents, agentic access. Useful if you're deciding whether to specialize in orchestration vs keep writing code.
How AI infra is repricing Taiwan/Korea equities via TSMC, Samsung, and SK Hynix concentration. Complements the Nate B Jones supply-chain thesis.
The line that anchored the Codex-vs-Claude rivalry topic this cycle — 9.8K likes, 556 RT, 703K views.
Goldman's number is what every infra slide will cite for the next quarter.
Top-voted thread of the cycle. Anthropic moves from product to education distribution — direct response to 'where do I learn agentic engineering' demand.
— highest in the pool. Discussion centers on whether token spend actually replaces headcount or just shifts the cost line.
Roast Calendar
Your AI week, day by day
Last Sip
Parting thoughts
The pattern of the day is pretty clear once you line up Salesforce's $300M token bill, Claude Dispatch's 50/50 success rate, and the SF event titled "What actually breaks first when they run 24/7." The bets are getting placed faster than the ops layer can hold them. If you're shipping an agent this week, the most useful read here might just be Samarth Vinayaka's memory-policy benchmark — pick a policy before your agent picks one for you.