May 26, 2026

Agentic Brew Daily

Your daily shot of what's brewing in AI

Fresh Batch

Distilled trend
  • Google's AI Mode default and DeepSeek's permanent 75% V4-Pro cut land the same week, pushing publishers toward an Ahrefs-measured 58% CTR loss while making frontier inference 28-34x cheaper than Anthropic or OpenAI.
  • Salesforce's $300M Anthropic spend with zero new engineers collides with 74% of enterprises rolling back agents and Claude Dispatch's 50% task-success rate, exposing the gap between agent budgets and agent reliability.
  • Google's WebMCP and Universal Cart, Anthropic's Stainless acquisition, and OpenAI Codex cancelling SaaS subscriptions from a phone all point to the same wedge: agents are now buying, building, and unsubscribing on the user's behalf, with per-seat SaaS as the explicit target.

Bold Shots

Today's biggest AI stories, no chaser

Anthropic's Claude Mythos Preview, gated behind Project Glasswing, scanned 1,000+ open-source projects and surfaced 23,019 vulnerabilities — 6,202 rated high/critical, 90.6% confirmed valid in a 1,752-finding sample. It also uncovered a CVSS 9.1 certificate-forgery flaw in wolfSSL (CVE-2026-5194). Twelve founding partners got preview access — AWS, Apple, Google, Microsoft, NVIDIA, JPMorgan among them. Cloudflare CSO Grant Bourzikas ran Mythos against 50+ internal repos and publicly argued the AI-written patches "are not safe to ship blind" — some silently broke their own code. Startup Depthfirst claims a task-specialized model matches Mythos at one-tenth the cost.

Why it matters: Vuln discovery used to queue behind a small population of skilled researchers; Mythos breaks the bottleneck and shifts the rate-limit to patching. But Cloudflare reframes the debate — the right posture might be assumed-compromise architecture, not faster patch SLAs. Depthfirst's cheaper task-specialized rival directly undercuts the "bigger frontier model always wins" thesis.

A viral clip of OpenAI Codex autonomously opening a billing page and cancelling an Amazon subscription has been seized on as the cleanest "agents eat SaaS" moment yet. The capability shipped April 16 with Codex's desktop computer-use. A May 22 update extended it to drive Mac apps while the screen is off and locked, with task triggering from a phone. OpenAI's own docs explicitly warn against unattended use for "account, security, privacy, network, payment, or credential-related settings" — the exact workflow being demoed.

Why it matters: Codex weekly actives went from 1.6M in March to 4M+ by mid-2026, with token throughput up ~5x in the same window. Simon Taylor calls it "SaaSpocalypse": if an agent can drive a vendor's billing UI to cancel, it can drive the product UI to replace the seat. iShares' tech-software ETF is starting to read agent capability as a SaaS demand risk. The pushback (HN's benzible, MindStudio) is that Codex's computer-use is narrower than Anthropic's by design and domain expertise still moats a lot of SaaS — but the optics matter regardless.

SpaceX filed its S-1 on May 20 for a Nasdaq listing under SPCX, targeting $1.75T-$2T and a raise of up to $75B. OpenAI confidentially filed a draft prospectus the same week for a Q4 2026 listing led by Goldman and Morgan Stanley at $852B-$2T+. SpaceX's S-1 identifies $26.5T of AI exposure inside a $28.5T TAM, and the company merged with xAI two months ago at a combined ~$1.25T valuation. BofA's Michael Hartnett estimates the two IPOs would push US single-sector market concentration from ~40% to ~48% — past every modern bubble peak, dot-com included.

Why it matters: Index-inclusion rules force passive ETF and 401(k) money into the new mega-caps within weeks of listing — retail can't opt out. The fundamentals strain the narrative: OpenAI generated ~$13.1B revenue in 2025 against a ~$9B net loss and ~$22B cash burn, and projects another ~$14B operating loss in 2026 against a $207B capital gap through 2030. April CPI ran 3.8%, near the 4% line BofA flags as a high-valuation IPO warning level. Anthony Scaramucci is calling SpaceX/OpenAI/Anthropic a "holy trinity" that may mark a market top.

Pope Leo XIV released Magnifica Humanitas on May 25 — a 42,000-word, five-chapter encyclical on safeguarding the human person in the age of AI. He signed it on May 15, the 135th anniversary of Leo XIII's Rerum Novarum. The text calls for AI to be "disarmed" from logics of military and economic domination, says classic just-war theory is outdated in an age of algorithmic warfare, and names hidden labor exploitation behind AI systems as "new forms of slavery." Pontiffs usually delegate encyclical unveilings to cardinals; Leo personally co-presented this one alongside Anthropic co-founder Christopher Olah — the first AI executive ever to help unveil a papal encyclical.

Why it matters: The stagecraft is the story. The text describes AI as "more cultivated than built" — language closer to a research note than curial Latin — and names hyperscalers as concentrating epistemic and political power. The co-presentation lands against Anthropic's ongoing legal fight with the Trump administration over military uses of its models. The Vatican is positioning itself as a moral authority on AI architecture and corporate incentives, not just AI use.

DeepSeek made its 75% V4-Pro discount permanent on May 22, freezing what was supposed to be a promo expiring May 31. List pricing is now $0.435/M input (cache-miss), $0.003625/M (cache-hit), and $0.87/M output. Against Claude Opus 4.7 and GPT-5.5 PRO at ~$30/M output, V4-Pro lands ~28-34x cheaper. V4-Pro is a 1.6-trillion-parameter model optimized for Huawei Ascend 950 chips rather than Nvidia (Huawei targeting ~750K 950PR units in 2026).

Why it matters: Counterpoint's Neil Shah argues V4-Pro has effectively closed the performance gap on math and reasoning while leading on openness and inference cost. Marcus Schuler's framing: Western labs "structurally cannot match the price without breaking the revenue models their valuations depend on." The second layer no spreadsheet resolves: buyers can't simply route production traffic through DeepSeek given the model runs on Huawei silicon while the White House escalates IP-theft accusations. Developers have already moved — the dominant pattern is plugging V4-Pro into Claude Code via OpenRouter and running overnight agentic loops that were previously prohibitive.

Slow Drip

Blog reads worth savoring

Analysis · Data Science CollectiveThe Memory Wall Is Strangling Your LLM: Why GPUs Are Faster Than You Think and Slower Than You Need

Quantifies the 200x gap between H100 theoretical throughput (62K tok/s) and real-world inference (100-300 tok/s) and walks through KV caching, speculative decoding, and diffusion LLMs as fixes — clean mental model for memory-bound vs compute-bound regimes.

Analysis · Towards AISliding Windows Forget: Why Long-Running LLM Apps Need Memory Policy

Open-source benchmark across 7 context policies — importance-based memory retains 90.7% of critical facts vs 10.8% for sliding windows at the same token budget. Actionable if you're building persistent agents.

Analysis · The AI CornerClaude Dispatch: The AI That Keeps Working When You Don't

Hands-on review of Anthropic's phone-to-desktop delegation with an honest 50/50 success-rate breakdown by task type — file searches reliable, terminal and multi-step tasks fail silently. Read it before you trust it with anything important.

Tutorial · Data Science CollectiveA Qwen 3.5 122B LLM on a 16 GB Mac mini: MoE Expert Streaming with TurboQuant-MLX

Reproducible recipe for running a 122B MoE on a $599 Mac mini by streaming only the 8 active experts per token from SSD — 9 GB peak RAM, 54 GB on disk via 3-bit quant. Local inference reframed as a disk-bandwidth problem, not a RAM one.

News · Towards AITwo HTML Attributes Now Turn Your Website Into an AI Agent Tool — Inside Chrome's WebMCP

Chrome 149 origin trial lets sites expose forms to AI agents via data-mcp-name / data-mcp-args, replacing brittle vision-based UI actuation with direct function calls. Web devs should track this now.

The Grind

Research papers, decoded

Reasoning & Evaluation8,729 upvotes · X / arxiv / alphaxiv · X
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Apple stress-tests Claude 3.7 Sonnet Thinking, DeepSeek-R1, and o3-mini on four controllable puzzles at the same 64K-token budget. Finds three sharp regimes: at low complexity standard LLMs beat the thinking variants, at medium complexity LRMs win, beyond a model-specific threshold both collapse to ~0. As problems get harder, the models reduce thinking tokens even with budget left, and handing them the explicit Tower of Hanoi algorithm barely helps.

Latent Reasoning198 upvotes · alphaxiv
Generative Recursive Reasoning Models (GRAM)

Turns deterministic single-trajectory recursive reasoning into probabilistic multi-trajectory computation via amortized variational inference, with a hierarchical high-level / low-level state structure and learned perturbation distributions. Hits 97.0% on Sudoku-Extreme (vs 87.4% deterministic), 52.0% on ARC-AGI-1, and works as an unconditional generative model (99.05% valid Sudoku boards). For tasks with multiple valid answers, deterministic recurrent reasoning is leaving accuracy on the table.

Test-Time Search67 upvotes · alphaxiv
Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Drop-in replacement for GRPO's scalar advantage — trains the policy to anticipate vector-valued rewards (per-test-case, multi-reward, multi-persona) and emit a set of solutions specialized to different trade-offs. Combines multi-answer chains with stochastic Dirichlet scalarization. The gap with GRPO widens with the search budget — on LiveCodeBench evolutionary search, VPO models solve problems GRPO models can't solve at all.

Test-Time Scaling48 upvotes · alphaxiv
Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Reframes iterative latent reasoning as a dynamical system converging to task-conditioned attractors. Three training tricks plus Adaptive Computation Time halting let the model unroll the equivalent of 40,000 layers — Sudoku-Extreme goes from 2.6% to 99.8%, Maze-Unique hits 93.0% accuracy with 17.4x less average compute via adaptive halting.

Long-Context46 upvotes · alphaxiv
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Splits the single scalar gate in delta-rule linear attention into independent channel-wise erase / write gates, with a chunkwise WY parallel algorithm and custom Triton kernels keeping training efficient. At 1.3B params on 100B FineWeb-Edu tokens it beats Mamba-2, GDN, KDA, and Mamba-3 variants — biggest gains on RULER multi-key retrieval (93.0% at 4K) and near-flat throughput from 2K to 16K context on H100. Code released.

The Mill

Builder tools ground for action

30K stars, +5.6K today

Turns any codebase into an interactive knowledge graph you can explore, search, and Q&A against. Works with Claude Code, Codex, Cursor, Copilot, and Gemini CLI. Devs are tired of paying token tax to re-explain their repo to every agent.

24K stars, +3.2K today

Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, OpenCode, and Hermes Agent: fewer tokens, fewer tool calls, 100% local. Same thesis as Understand-Anything — agent-native code indexing is the new ctags.

18K stars, +3.2K today

Practical hands-on AI engineering curriculum. 3K stars in a day says the 'year of agent experience but no fundamentals' crowd is finally looking for a structured ramp.

154K stars, +2.8K today

Single CLAUDE.md file distilling Karpathy's observations on LLM coding pitfalls — drop-in for Claude Code. Surging alongside Karpathy's reported move to Anthropic.

15K stars, +1.4K today

Open-source plugins for Claude Cowork, aimed at knowledge workers (not just devs). Anthropic officially leaning into the plugin ecosystem.

457 votesProduct Hunt

Generate and iterate UI screens with AI on a live canvas. Google's design-tool entry — direct shot at Vercel v0 and Figma Make.

Design Tools / User Experience
305 votesProduct Hunt

The missing menu bar app for local LLMs on Mac. Pairs neatly with M5 Max / DGX Spark local-inference chatter — managing local models is the new pain point.

Open Source / Developer Tools
286 votesProduct Hunt

Automate any Mac app with $0 recurring run cost. Local-first Mac automation — same 'stop paying SaaS per agent' thread as ModelHub.

Artificial Intelligence / GitHub
177 votesProduct Hunt

Claude Code that never stops. Automatic model failover for Claude Code sessions — market response to GPU rentals up 200%.

Productivity / Software Engineering
155 votesProduct Hunt

Generate edited, sound-designed videos via chat. Runway moves from tool to agent.

Design Tools / Social Media

The Counter

Voices from the AI bar today

views

Argues AI procurement is becoming a supply-chain problem, not a software problem — HBM, advanced packaging, and grid power are the binding constraints. Aimed at people planning 2026 capacity.

AI News & Strategy Daily | Nate B Jones
views

Five-pillar framework: agent harnesses, software factories, extensible software, always-on agents, agentic access. Useful if you're deciding whether to specialize in orchestration vs keep writing code.

IndyDevDan
views

How AI infra is repricing Taiwan/Korea equities via TSMC, Samsung, and SK Hynix concentration. Complements the Nate B Jones supply-chain thesis.

CNBC International
engagement

The line that anchored the Codex-vs-Claude rivalry topic this cycle — 9.8K likes, 556 RT, 703K views.

@naval
engagement

Goldman's number is what every infra slide will cite for the next quarter.

@BankXRP
upvotes

Top-voted thread of the cycle. Anthropic moves from product to education distribution — direct response to 'where do I learn agentic engineering' demand.

r/ClaudeAI
upvotes

— highest in the pool. Discussion centers on whether token spend actually replaces headcount or just shifts the cost line.

r/ArtificialInteligence

Roast Calendar

Your AI week, day by day

Last Sip

Parting thoughts

The pattern of the day is pretty clear once you line up Salesforce's $300M token bill, Claude Dispatch's 50/50 success rate, and the SF event titled "What actually breaks first when they run 24/7." The bets are getting placed faster than the ops layer can hold them. If you're shipping an agent this week, the most useful read here might just be Samarth Vinayaka's memory-policy benchmark — pick a policy before your agent picks one for you.