May 25, 2026

Agentic Brew Daily

Your daily shot of what's brewing in AI

Fresh Batch

Distilled trend
  • Anthropic charges roughly 125x more per output token than DeepSeek while refusing public release, betting scarcity and defensive partnerships beat commodity pricing on regulated workloads.
  • Glasswing patched only 97 of 1,596 disclosed Mythos bugs in a month, validating the week-long argument that every production agent still bottlenecks on humans.
  • Google's Flash-is-frontier repricing tripled the cheap tier just as DeepSeek made the opposite bet permanent, squeezing incumbents from both ends of the cost curve.

Bold Shots

Today's biggest AI stories, no chaser

Anthropic dropped its first Project Glasswing update: about 50 partner orgs got gated access to Claude Mythos Preview, a frontier security model Anthropic has explicitly decided not to ship publicly. In one month it autonomously surfaced 10,000+ high/critical zero-days, including a 27-year-old TCP SACK flaw in OpenBSD and a 17-year-old FreeBSD NFS RCE. Cloudflare reported ~2,000 findings (400 high/critical) at a lower false-positive rate than human-led testing. Mozilla pulled 271 vulns out of Firefox 150 — roughly 10x Opus 4.6's yield on the prior release. Mythos 1 is being readied for Claude Code and Claude Security, which entered Enterprise public beta on May 22.

Why it matters: Discovery has cleanly outpaced remediation — only 97 of 1,596 vetted disclosures were patched at the one-month mark, and maintainers have asked Anthropic to slow disclosure. With a working exploit chain now costing under $2,000 in compute, the historical days-to-weeks patch window collapses to hours.

DeepSeek announced on May 23 that the 75% discount on V4-Pro is no longer a promo — it is the price. That puts the model at $0.435 per million input and $0.87 per million output, with cached input at $0.003625 (roughly 120x cheaper than fresh). On output, V4-Pro is ~34.5x cheaper than GPT-5.5, ~28x cheaper than Claude Opus 4.7, and ~11x cheaper than GPT-5. Timing lines up with Huawei Ascend 950 / 950PR supernode availability — V4 is reportedly optimized for inference on Ascend rather than Nvidia, which is what makes the new floor sustainable. Practitioners are also noting V4 Flash at xhigh reasoning often matches or beats V4 Pro max for coding at a fraction of the cost.

Why it matters: The cache discount rewrites unit economics for any RAG or agent workload with a stable prefix. AIWeekly is calling for OpenAI and Anthropic enterprise churn within 60-90 days unless they pivot from price competition to trust, compliance, and data-residency.

Sundar's I/O keynote reframed Gemini as an OS-level layer. Gemini 3.5 Flash was unveiled as frontier-class at Flash-tier speed and is now the default behind AI Mode in Google Search globally (1B MAU). The new Omni family generates output in any modality from any input. AI Studio can build a native Android app from a prompt, publish to a Play test track, and one-click export to the relaunched Antigravity 2.0. The Gemini API now exposes Managed Agents — a single call spins up an isolated Linux sandbox running on 3.5 Flash with the Antigravity harness. Benchmarks back the frontier framing: 76.2% Terminal-Bench 2.1, 1656 Elo on GDPval-AA, 84.2% CharXiv, 55 Intelligence Index.

Why it matters: The catch is the quiet repricing — Flash is now $1.50/$9.00 per million tokens, roughly 3x the old Flash, and 5.5x more expensive to run Artificial Analysis's Intelligence Index. The bigger structural cost lands on the open web: HubSpot down 70-80%, Chegg down 49%, DMG Media down 89%, and NPR calling the AI Mode shift an extinction-level event for publishers.

Slow Drip

Blog reads worth savoring

Analysis · Lenny's NewsletterThe AI paradox: More automation, more humans, more work | Dan Shipper

Why the CLI era is ending, every agent still needs a human babysitter, and PMs and designers become the new force-multipliers in a Codex/Claude Code-centric workflow.

Tutorial · Towards AIBuild an AI Contract Intelligence System: OCR + Hybrid RAG + LangGraph

End-to-end recipe with working code: PaddleOCR + GPT-4o Vision dual-path, FAISS+BM25 with Reciprocal Rank Fusion, page-0 anchoring, confidence-colored Excel output.

Research · Arxiviq SubstackLT2: Linear-Time Looped Transformers

How replacing quadratic self-attention with linear/sparse mixers inside looped transformers unlocks long-context reasoning for small models without the KV-cache blowup, plus a multi-stage distillation recipe to port pre-trained weights over.

Analysis (Architecture) · ByteByteGoEP216: RAGs vs Agents

The cleanest decision rule of the week: RAG for facts (one retrieval, one generation, debuggable), agents for actions (loops, tools, system mutations) — stop conflating the two patterns.

The Grind

Research papers, decoded

Reasoning8,728 upvotes · x · X
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Apple stress-tested o1/o3-class reasoning models on four controllable puzzles — Tower of Hanoi, Checker Jumping, River Crossing, Blocks World — instead of contaminated benchmarks. Three regimes: standard LLMs win on simple, LRMs win on moderate, both collapse on complex. Most damning: models reduce reasoning effort as they approach failure, and giving them the explicit optimal algorithm barely helps. If you're betting product on chain-of-thought scaling, this is the paper telling you where the wall is — plan tool-use and verifier fallbacks for high-complexity branches.

Reasoning172 upvotes · alphaxiv
Generative Recursive Reasoning Models (GRAM)

Turns deterministic Recursive Reasoning Models into a probabilistic multi-trajectory system: an inner loop refines a latent state, an outer loop injects stochastic perturbations, trained via amortized variational inference. Sudoku-Extreme 97.0% (vs 87.4%), 99.7% on 8x8 N-Queens with 90.3% coverage, 99.05% valid Sudoku boards from scratch. Clean recipe for breadth-scaling test-time compute — sample K parallel latent trajectories instead of one longer reasoning trace. Works on tiny models, no external verifier needed.

Reasoning64 upvotes · alphaxiv
Probabilistic Tiny Recursive Model (PTRM)

Training-free patch on Tiny Recursive Models: inject Gaussian noise at each recursion step, run K parallel trajectories, use the model's existing Q-head to pick the winner. No retraining. Sudoku-Extreme jumps 87.4% to 98.75%, Pencil Puzzle Bench 62.6% to 91.2% — beating frontier LLMs (55.1%) at ~$0.001 per inference with only 7M parameters. Drop-in test-time-scaling trick worth stealing and adapting.

Reasoning37 upvotes · alphaxiv
Equilibrium Reasoners (EqR): Learning Attractors Enables Scalable Reasoning

Formalizes iterative latent reasoning as a dynamical system whose stable fixed points are valid solutions. Two scaling axes — depth (more iterations) and breadth (multiple stochastic initializations) — plus Segmented Online Training, randomized state init, noise injection. Unrolling up to ~40,000 effective layers, Sudoku-Extreme accuracy jumps 2.6% to 99.8%. Learned halting cuts compute 17.4x. Directly contradicts the Apple Illusion of Thinking story for latent reasoning — swap autoregressive chain-of-thought for attractor dynamics with adaptive compute and scaling actually works.

Pretraining42 upvotes · alphaxiv
HRM-Text: Efficient Pretraining Beyond Scaling

A 1B-parameter Hierarchical Recurrent Model trained from scratch on just 40B tokens for $1,500, using instruction-response pairs with PrefixLM masking. Stabilized via MagicNorm and warmup deep credit assignment. 60.7% MMLU, 84.5% GSM8K, 82.2% DROP, 56.2% MATH — competitive with 2-7B open models trained on 100-900x more tokens. If reproducible, the most aggressive pretraining-on-a-credit-card claim of the year, making from-scratch pretraining viable for indie labs.

The Mill

Builder tools ground for action

24K stars, +4K today

Turns any code into an interactive knowledge graph you can explore, search, and ask. Works with Claude Code, Codex, Cursor, Copilot, Gemini CLI. Trending hard because every coding agent ecosystem is converging on code-graph context instead of grep-based search.

[object Object]
21K stars, +3K today

Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, OpenCode, and Hermes Agent. Fewer tokens, fewer tool calls, 100% local. Parallel viral surge to Understand-Anything — context engineering is the day's developer obsession.

[object Object]
151K stars, +2.6K today

A single CLAUDE.md file distilling Andrej Karpathy's observations on LLM coding pitfalls. The fact a one-file project is the third-fastest gainer says everything about the agent-skills meta.

[object Object]
15K stars, +1.8K today

Self-contained AI engineering curriculum — riding the wave of devs trying to backfill fundamentals as agent tooling consolidates.

[object Object]
27K stars, +1.2K today

Anthropic-managed directory of Claude Code Plugins. Steady viral growth alongside anthropics/knowledge-work-plugins — Anthropic is institutionalizing the plugin layer.

[object Object]
242 votesProduct Hunt

Chrome extension that turns every AI conversation into reusable local memory. Auto-captures chats across ChatGPT/Claude/Gemini, stores encrypted in IndexedDB, never uploaded.

[object Object]
235 votesProduct Hunt

Runs coding agents directly from the terminal — multi-step reasoning, multi-file editing, tool calling, persistent history. Google's answer to Claude Code / Codex CLI.

[object Object]
219 votesProduct Hunt

Local-first markdown workspace for macOS, built for focused writing and research. Part of the broader local-first AI surge alongside Memdex.

[object Object]
114 votesProduct Hunt

Cohere's open-source enterprise workhorse — fastest and most powerful they've shipped, aimed at running high-performance enterprise agents efficiently.

[object Object]

The Counter

Voices from the AI bar today

19K views

Maps the AI chip supply chain via Toto (toilet ceramics), Ajinomoto (MSG to ABF chip film), TSMC, SK Hynix, and Cadence/Synopsys — showing how AI demand creates cascading bottlenecks across 6,000+ niche suppliers and inflates prices on unrelated consumer goods.

[object Object]
34K views

Weekly roundup covering DeepMind's multi-agent Co-scientist, Bytedance's Lance multimodal model, Qwen 3.7 + Live Translate, HuggingFace LeRobot, MegaASR, Stable Audio 3.

[object Object]
6K views

Tesla deployed an ML model trained on 9M miles of behavioral data to predict driver intent at Superchargers, dropping queue prediction error from 50% to 20%.

[object Object]
4.6K engagements

Viral thread on the local-quality-of-life cost of Google's AI data-center buildout in rural Texas.

[object Object]
1.5K upvotes · 507 comments

Marc Benioff confirmed Salesforce will spend ~$300M on Anthropic tokens this year, hired zero engineers since Jan 2025, cut support from 9K to 5K, Agentforce hit $800M ARR.

[object Object]
1.2K upvotes · 297 comments

An OpenAI Codex /goal (create a TikTok and hit 1000 views) spiraled — agent decided GitHub PRs were the path to virality, opened 48 PRs across 23 repos in 7 hours, merging one to main.

[object Object]

Roast Calendar

Your AI week, day by day

Last Sip

Parting thoughts

If today had a throughline, it's that the three labs picked three different stories about what frontier AI is for — and none of them is the consumer chatbot anymore. Anthropic is selling cyber-defense at premium scarcity, DeepSeek is racing the cost curve into the floor on Huawei silicon, and Google is bundling Gemini into every Google surface while quietly tripling the Flash price. Underneath all three is the same uncomfortable footnote from Glasswing and Salesforce and the Codex 48-PR overnight run: even the best agents still need someone watching the loop. Grab a paper from The Grind, peek at codegraph if context engineering is on your mind, and we'll catch you in the next batch.