DeepSeek V4 Preview launch: 1M-context MoE on Huawei Ascend
TECH

DeepSeek V4 Preview launch: 1M-context MoE on Huawei Ascend

61+
Signals

Strategic Overview

  • 01.
    DeepSeek released V4 Preview on April 24, 2026 as two open-weight Mixture-of-Experts models: V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B / 13B active), both with a 1-million-token context window.
  • 02.
    A new Hybrid Attention Architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) cuts inference FLOPs to 27% and KV cache to 10% of V3.2 at 1M-token context.
  • 03.
    Huawei announced full Ascend NPU support and Cambricon delivered Day-0 adaptation with open-sourced integration code, marking the first DeepSeek flagship validated on Chinese accelerators at launch.
  • 04.
    Both models ship under the MIT license on Hugging Face and ModelScope, with API endpoints supporting OpenAI ChatCompletions and Anthropic-compatible formats; legacy deepseek-chat and deepseek-reasoner endpoints retire on July 24, 2026.

The Real Headline Isn't the Model — It's the Stack

Strip away the benchmark numbers and what V4 actually proves is structural: for the first time, a frontier-class open-weight model launched with same-day inference support on Huawei Ascend NPUs and Cambricon accelerators, with the integration code open-sourced to GitHub. That is a different category of event than 'another Chinese model release'. R1 in January 2025 made Silicon Valley nervous about Chinese training efficiency; V4 makes Silicon Valley nervous about something harder to fix — a Chinese end-to-end stack from chip to compiler to model that ships on the same calendar day as the model itself.

This is the scenario Jensen Huang named out loud: 'the day that DeepSeek comes out on Huawei first, that is a horrible outcome' for the United States. April 24 was not literally that day — Liu Zhiyuan of Tsinghua notes V4's main pre-training was likely still done on NVIDIA hardware, and only parts of the training were adapted to Chinese chips. But the inference path, where the dollars actually live for a deployed model, is now demonstrably non-NVIDIA-feasible at flagship quality. Huatai Securities' brokerage desk read it as a direct catalyst for domestic-chip adoption in 2026; the SMIC stock jump of roughly 10% the same day suggests the equity market agrees. The geopolitical signal is the lead, not the lagging indicator.

Where the 85% Price Cut Actually Comes From

Where the 85% Price Cut Actually Comes From
V4-Pro output tokens cost $3.48 per million versus $30 for GPT-5.5 and $25 for Claude Opus 4.6 — roughly an 88% discount at the output side.

The pricing — V4-Pro at $1.74 / $3.48 per million input/output tokens, V4-Flash at $0.14 / $0.28, versus GPT-5.5 at roughly $5 / $30 — looks like a marketing decision but is actually an architectural one. The new Hybrid Attention combines Compressed Sparse Attention (CSA), which lets the model only attend to the small fraction of past tokens that matter for the current step, with Heavily Compressed Attention (HCA), which stores those tokens in a much smaller per-token memory footprint. The result, at a 1M-token context, is roughly 27% of V3.2's inference FLOPs and just 10% of its KV cache (the per-token memory the GPU has to hold while generating). V4-Flash is even more aggressive: 10% of FLOPs and 7% of KV cache. The Register's measurement frames this as a 9.5x to 13.7x reduction in memory needed to serve million-token contexts.

In practical terms, that is what makes the cents-per-million numbers physically possible. A 1M-token agent run — paste an entire codebase, ask the model to find a bug, watch it reason across the whole repo — was uneconomic at V3.2 efficiency on any provider's hardware. It is now roughly an order of magnitude cheaper to serve, which is why the Mixture-of-Experts choice (only 49B of 1.6T parameters fire on any given token for V4-Pro, only 13B of 284B for V4-Flash) compounds with the attention redesign rather than competing with it. The 73% headline 'cost reduction vs. prior generation' is the accounting consequence of these two design choices stacked. For closed-source incumbents, the uncomfortable read is that DeepSeek isn't winning on subsidy — it's winning on the inference math.

The Receipt Inside the Paper

Buried in §5.4.4 of the V4 technical report — and the line the r/LocalLLaMA launch thread fixated on — is a quietly extraordinary internal-adoption claim: 52% of DeepSeek's own engineers say V4-Pro is now their primary coding model, with another 39% leaning that way. That is the kind of dogfooding receipt Anthropic and OpenAI tend not to publish, and it does more credibility work than any benchmark line on MMLU-Pro or LiveCodeBench. Independently, Artificial Analysis ranked V4-Pro #1 among open-weight models on its GDPval-AA agentic real-world evaluation and #2 overall on its Intelligence Index behind only Kimi K2.6, and SGLang shipped Day-0 inference optimization. None of that lives in the same epistemic register as a self-reported MMLU score.

The community reaction split immediately along this fault line. The viral LocalLLaMA 'Deepseek v4 people' thread spent its top comments dismissing V4's correct answer to a classic 'winter tires' reasoning gotcha as memorization rather than reasoning — 'It's in the data at this point' — a healthy reminder that contamination skepticism scales with a model's perceived strength. On r/singularity the dominant frame was simpler: DeepSeek is 'smelling blood in the water' the same week OpenAI doubled GPT-5.5 prices. AnythingLLM's Tim Carambat, in YouTube commentary, argued the CSA/HCA attention design is the more important story than any of the benchmark scoreboards — which, given the price math above, is the read that ages best.

Why the Market Yawn Is the Bullish Signal

The most interesting tell is what didn't happen. In January 2025, R1 and V3 took roughly $590B off NVIDIA's market cap in a single news cycle. V4 — a more capable model, on Chinese silicon, at lower prices — produced a far more contained reaction concentrated in Chinese names: SMIC up around 10% in Hong Kong, MiniMax and Knowledge Atlas down more than 9% as investors rotated out of competing Chinese labs presumed to be on the wrong side of the price collapse. NVIDIA did not crater this time.

Two readings. The bearish one for NVIDIA: the market has already priced in a world where Chinese AI runs on Chinese chips at the margin, so the V4 confirmation is news but not surprise. The bullish one for DeepSeek: the launch is being absorbed as structural rather than episodic — not a one-off shock but the next datapoint in a now-established trend line. White House Science & Technology Advisor Michael Kratsios responded with the standard framing that 'there is nothing innovative about systematically extracting and copying the innovations of American industry,' which is the kind of statement governments make when capital markets have already moved on. Either reading is harder for closed-source flagship API providers than for chipmakers, which is also where the immediate competitive pressure now sits — on token pricing, not on training compute.

The Caveats the Hype Cycle Will Skip

Three things worth holding in tension with the headline. First, Tsinghua's Liu Zhiyuan's caveat that V4's main pre-training was likely still done on NVIDIA hardware: the Chinese-stack story is real for inference and partial for training, and the gap matters because training is where the export-control bite is sharpest. 'Day-0 Huawei support' is a deployment claim, not a training-independence claim, and the two will keep getting conflated in coverage.

Second, on local hosting: the LocalLLaMA thread reality-checked the implied accessibility. V4-Flash needs 256GB+ of RAM despite only 13B active parameters, because Mixture-of-Experts models still have to load all expert weights into memory even when most don't fire per token. 'Open weights' is not the same as 'runs on your gaming rig.' Third, the contamination skepticism on Reddit isn't paranoid — when a model nails every classic reasoning gotcha, the prior on memorization rises mechanically, and the more interesting public evals (Artificial Analysis GDPval-AA, where V4-Pro ranked first among open weights for agentic real-world tasks) are doing more truth-telling work than any single MMLU-Pro number. The launch is genuinely consequential; it is also the moment when the next twelve months of independent replication and adversarial probing actually start.

Historical Context

2025-01
DeepSeek's R1 and V3 releases triggered an estimated $590B drawdown in NVIDIA's market capitalization and became Silicon Valley's first serious shock over Chinese model efficiency.
2025-09
DeepSeek-V3.2 introduced DeepSeek Sparse Attention as the precursor to V4's CSA/HCA hybrid; the V3 line previously capped context at 128K tokens.
2026-04-24
DeepSeek V4 Preview (Pro and Flash) launched on Hugging Face with 1M-token context, FP4/FP8 mixed precision, MIT license, and Day-0 Huawei Ascend and Cambricon support.
2026-07-24
Legacy endpoints deepseek-chat and deepseek-reasoner are scheduled for retirement, completing the API migration to the V4 generation.

Power Map

Key Players
Subject

DeepSeek V4 Preview launch: 1M-context MoE on Huawei Ascend

DE

DeepSeek (founder Liang Wenfeng)

Released V4 Preview as a flagship open-weight challenger to OpenAI and Anthropic, reportedly closing a funding round at roughly a $20 billion valuation alongside the launch.

HU

Huawei

Provided full Ascend NPU and supernode support for V4 inference on Day 0, positioning Ascend silicon as a credible substitute for NVIDIA in Chinese AI deployment.

CA

Cambricon Technologies

Co-engineered hardware compatibility with V4, providing Day-0 adaptation and open-sourcing the integration code on GitHub.

NV

NVIDIA

Faces strategic risk as V4 becomes the first frontier-class model effectively co-launched on Chinese silicon; Jensen Huang has publicly named exactly this scenario as the bad-case outcome for the U.S.

OP

OpenAI and Anthropic

Direct pricing competitors; V4-Pro undercuts GPT-5.5 and Claude Opus 4.6 output-token pricing by roughly 85%, intensifying margin pressure on closed-source flagship APIs.

SM

SMIC, MiniMax and Knowledge Atlas

Capital-markets bellwethers for the V4 thesis: SMIC jumped roughly 10% in Hong Kong on the read-through to domestic chip demand, while competing Chinese AI labs MiniMax and Knowledge Atlas fell more than 9%.

Source Articles

Top 5

THE SIGNAL.

Analysts

"Has warned publicly that a Chinese frontier model first appearing on Huawei silicon would be the strategically worst-case outcome for the United States — the exact scenario V4 now approximates."

Jensen Huang
CEO, NVIDIA

"Reads V4-Pro's benchmark profile as 'excellent agent capability at significantly lower cost' than US closed-source incumbents, framing the launch as primarily an agent-economics event."

Wei Sun
Principal AI Analyst, Counterpoint Research

"Cautions that V4's main pre-training run was likely still done on NVIDIA hardware, with only partial training adapted to Chinese chips so far — tempering the 'NVIDIA-free' headline."

Liu Zhiyuan
Computer Science Professor, Tsinghua University

"Calls V4 an explicit catalyst for adoption of domestic Chinese accelerators in 2026, expecting 'a significant improvement in the capabilities of domestic graphics cards and their widespread adoption this year'."

Huatai Securities
Brokerage research desk

"Pushed back on Chinese AI advances as derivative, saying 'there is nothing innovative about systematically extracting and copying the innovations of American industry' — the prevailing US policy framing of the launch."

Michael Kratsios
White House Science & Technology Advisor
The Crowd

"DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. DeepSeek-V4-Flash: 284B total / 13B active params."

@@deepseek_ai0

"DeepSeek V4 Pro is the #1 open weights model on GDPval-AA, our agentic real-world work tasks evaluation. @deepseek_ai has released V4 Pro (1.6T total / 49B active) and V4 Flash (284B total / 13B active). V4 is DeepSeek's first new size since V3."

@@ArtificialAnlys0

"DeepSeek V4 by @deepseek_ai just dropped! SGLang is ready on Day 0 with a full stack of optimizations from architectures to low-level kernels. We also deliver a verified RL training pipeline in Miles (by @radixark) for V4 at launch."

@@lmsysorg0

"Deepseek v4 people"

@u/markeus1012070
Broadcast
GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies

GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies

DeepSeek V4 Is HERE - Testing the LARGEST Open Source Model Ever!

DeepSeek V4 Is HERE - Testing the LARGEST Open Source Model Ever!

DeepSeek V4 just shocked the AI industry...

DeepSeek V4 just shocked the AI industry...