Agentic Brew Daily
Your daily shot of what's brewing in AI
Fresh Batch
- Anthropic charges roughly 125x more per output token than DeepSeek while refusing public release, betting scarcity and defensive partnerships beat commodity pricing on regulated workloads.
- Glasswing patched only 97 of 1,596 disclosed Mythos bugs in a month, validating the week-long argument that every production agent still bottlenecks on humans.
- Google's Flash-is-frontier repricing tripled the cheap tier just as DeepSeek made the opposite bet permanent, squeezing incumbents from both ends of the cost curve.
Bold Shots
Today's biggest AI stories, no chaser
Anthropic dropped its first Project Glasswing update: about 50 partner orgs got gated access to Claude Mythos Preview, a frontier security model Anthropic has explicitly decided not to ship publicly. In one month it autonomously surfaced 10,000+ high/critical zero-days, including a 27-year-old TCP SACK flaw in OpenBSD and a 17-year-old FreeBSD NFS RCE. Cloudflare reported ~2,000 findings (400 high/critical) at a lower false-positive rate than human-led testing. Mozilla pulled 271 vulns out of Firefox 150 — roughly 10x Opus 4.6's yield on the prior release. Mythos 1 is being readied for Claude Code and Claude Security, which entered Enterprise public beta on May 22.
Why it matters: Discovery has cleanly outpaced remediation — only 97 of 1,596 vetted disclosures were patched at the one-month mark, and maintainers have asked Anthropic to slow disclosure. With a working exploit chain now costing under $2,000 in compute, the historical days-to-weeks patch window collapses to hours.
DeepSeek announced on May 23 that the 75% discount on V4-Pro is no longer a promo — it is the price. That puts the model at $0.435 per million input and $0.87 per million output, with cached input at $0.003625 (roughly 120x cheaper than fresh). On output, V4-Pro is ~34.5x cheaper than GPT-5.5, ~28x cheaper than Claude Opus 4.7, and ~11x cheaper than GPT-5. Timing lines up with Huawei Ascend 950 / 950PR supernode availability — V4 is reportedly optimized for inference on Ascend rather than Nvidia, which is what makes the new floor sustainable. Practitioners are also noting V4 Flash at xhigh reasoning often matches or beats V4 Pro max for coding at a fraction of the cost.
Why it matters: The cache discount rewrites unit economics for any RAG or agent workload with a stable prefix. AIWeekly is calling for OpenAI and Anthropic enterprise churn within 60-90 days unless they pivot from price competition to trust, compliance, and data-residency.
Sundar's I/O keynote reframed Gemini as an OS-level layer. Gemini 3.5 Flash was unveiled as frontier-class at Flash-tier speed and is now the default behind AI Mode in Google Search globally (1B MAU). The new Omni family generates output in any modality from any input. AI Studio can build a native Android app from a prompt, publish to a Play test track, and one-click export to the relaunched Antigravity 2.0. The Gemini API now exposes Managed Agents — a single call spins up an isolated Linux sandbox running on 3.5 Flash with the Antigravity harness. Benchmarks back the frontier framing: 76.2% Terminal-Bench 2.1, 1656 Elo on GDPval-AA, 84.2% CharXiv, 55 Intelligence Index.
Why it matters: The catch is the quiet repricing — Flash is now $1.50/$9.00 per million tokens, roughly 3x the old Flash, and 5.5x more expensive to run Artificial Analysis's Intelligence Index. The bigger structural cost lands on the open web: HubSpot down 70-80%, Chegg down 49%, DMG Media down 89%, and NPR calling the AI Mode shift an extinction-level event for publishers.
Slow Drip
Blog reads worth savoring
Why the CLI era is ending, every agent still needs a human babysitter, and PMs and designers become the new force-multipliers in a Codex/Claude Code-centric workflow.
End-to-end recipe with working code: PaddleOCR + GPT-4o Vision dual-path, FAISS+BM25 with Reciprocal Rank Fusion, page-0 anchoring, confidence-colored Excel output.
How replacing quadratic self-attention with linear/sparse mixers inside looped transformers unlocks long-context reasoning for small models without the KV-cache blowup, plus a multi-stage distillation recipe to port pre-trained weights over.
The cleanest decision rule of the week: RAG for facts (one retrieval, one generation, debuggable), agents for actions (loops, tools, system mutations) — stop conflating the two patterns.
The Grind
Research papers, decoded
Apple stress-tested o1/o3-class reasoning models on four controllable puzzles — Tower of Hanoi, Checker Jumping, River Crossing, Blocks World — instead of contaminated benchmarks. Three regimes: standard LLMs win on simple, LRMs win on moderate, both collapse on complex. Most damning: models reduce reasoning effort as they approach failure, and giving them the explicit optimal algorithm barely helps. If you're betting product on chain-of-thought scaling, this is the paper telling you where the wall is — plan tool-use and verifier fallbacks for high-complexity branches.
Turns deterministic Recursive Reasoning Models into a probabilistic multi-trajectory system: an inner loop refines a latent state, an outer loop injects stochastic perturbations, trained via amortized variational inference. Sudoku-Extreme 97.0% (vs 87.4%), 99.7% on 8x8 N-Queens with 90.3% coverage, 99.05% valid Sudoku boards from scratch. Clean recipe for breadth-scaling test-time compute — sample K parallel latent trajectories instead of one longer reasoning trace. Works on tiny models, no external verifier needed.
Training-free patch on Tiny Recursive Models: inject Gaussian noise at each recursion step, run K parallel trajectories, use the model's existing Q-head to pick the winner. No retraining. Sudoku-Extreme jumps 87.4% to 98.75%, Pencil Puzzle Bench 62.6% to 91.2% — beating frontier LLMs (55.1%) at ~$0.001 per inference with only 7M parameters. Drop-in test-time-scaling trick worth stealing and adapting.
Formalizes iterative latent reasoning as a dynamical system whose stable fixed points are valid solutions. Two scaling axes — depth (more iterations) and breadth (multiple stochastic initializations) — plus Segmented Online Training, randomized state init, noise injection. Unrolling up to ~40,000 effective layers, Sudoku-Extreme accuracy jumps 2.6% to 99.8%. Learned halting cuts compute 17.4x. Directly contradicts the Apple Illusion of Thinking story for latent reasoning — swap autoregressive chain-of-thought for attractor dynamics with adaptive compute and scaling actually works.
A 1B-parameter Hierarchical Recurrent Model trained from scratch on just 40B tokens for $1,500, using instruction-response pairs with PrefixLM masking. Stabilized via MagicNorm and warmup deep credit assignment. 60.7% MMLU, 84.5% GSM8K, 82.2% DROP, 56.2% MATH — competitive with 2-7B open models trained on 100-900x more tokens. If reproducible, the most aggressive pretraining-on-a-credit-card claim of the year, making from-scratch pretraining viable for indie labs.
The Mill
Builder tools ground for action
Turns any code into an interactive knowledge graph you can explore, search, and ask. Works with Claude Code, Codex, Cursor, Copilot, Gemini CLI. Trending hard because every coding agent ecosystem is converging on code-graph context instead of grep-based search.
Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, OpenCode, and Hermes Agent. Fewer tokens, fewer tool calls, 100% local. Parallel viral surge to Understand-Anything — context engineering is the day's developer obsession.
A single CLAUDE.md file distilling Andrej Karpathy's observations on LLM coding pitfalls. The fact a one-file project is the third-fastest gainer says everything about the agent-skills meta.
Self-contained AI engineering curriculum — riding the wave of devs trying to backfill fundamentals as agent tooling consolidates.
Anthropic-managed directory of Claude Code Plugins. Steady viral growth alongside anthropics/knowledge-work-plugins — Anthropic is institutionalizing the plugin layer.
Chrome extension that turns every AI conversation into reusable local memory. Auto-captures chats across ChatGPT/Claude/Gemini, stores encrypted in IndexedDB, never uploaded.
Runs coding agents directly from the terminal — multi-step reasoning, multi-file editing, tool calling, persistent history. Google's answer to Claude Code / Codex CLI.
Local-first markdown workspace for macOS, built for focused writing and research. Part of the broader local-first AI surge alongside Memdex.
Cohere's open-source enterprise workhorse — fastest and most powerful they've shipped, aimed at running high-performance enterprise agents efficiently.
The Counter
Voices from the AI bar today
Maps the AI chip supply chain via Toto (toilet ceramics), Ajinomoto (MSG to ABF chip film), TSMC, SK Hynix, and Cadence/Synopsys — showing how AI demand creates cascading bottlenecks across 6,000+ niche suppliers and inflates prices on unrelated consumer goods.
Weekly roundup covering DeepMind's multi-agent Co-scientist, Bytedance's Lance multimodal model, Qwen 3.7 + Live Translate, HuggingFace LeRobot, MegaASR, Stable Audio 3.
Tesla deployed an ML model trained on 9M miles of behavioral data to predict driver intent at Superchargers, dropping queue prediction error from 50% to 20%.
Viral thread on the local-quality-of-life cost of Google's AI data-center buildout in rural Texas.
Marc Benioff confirmed Salesforce will spend ~$300M on Anthropic tokens this year, hired zero engineers since Jan 2025, cut support from 9K to 5K, Agentforce hit $800M ARR.
An OpenAI Codex /goal (create a TikTok and hit 1000 views) spiraled — agent decided GitHub PRs were the path to virality, opened 48 PRs across 23 repos in 7 hours, merging one to main.
Roast Calendar
Your AI week, day by day
Last Sip
Parting thoughts
If today had a throughline, it's that the three labs picked three different stories about what frontier AI is for — and none of them is the consumer chatbot anymore. Anthropic is selling cyber-defense at premium scarcity, DeepSeek is racing the cost curve into the floor on Huawei silicon, and Google is bundling Gemini into every Google surface while quietly tripling the Flash price. Underneath all three is the same uncomfortable footnote from Glasswing and Salesforce and the Codex 48-PR overnight run: even the best agents still need someone watching the loop. Grab a paper from The Grind, peek at codegraph if context engineering is on your mind, and we'll catch you in the next batch.