TECH

NVIDIA SANA-WM Open-Source World Model

28+

Signals

Strategic Overview

01.
SANA-WM is a 2.6B-parameter open-source world model from NVIDIA that generates one-minute 720p videos with 6-DoF camera control from a single image, text prompt, and camera trajectory.
02.
The architecture uses Hybrid Linear Attention (frame-wise Gated DeltaNet plus selective softmax attention), a Dual-Branch Camera Control module, a two-stage pipeline with a long-video refiner, and a metric-scale 6-DoF pose annotation pipeline.
03.
Full training used 64 H100s for 15 days on roughly 213K public video clips; a distilled variant with NVFP4 quantization denoises a 60-second 720p clip in 34 seconds on a single RTX 5090.
04.
On the one-minute world-model benchmark SANA-WM reports comparable visual quality to closed industrial baselines (LingBot-World, HY-WorldPlay) at 36x higher throughput, with stronger action-following accuracy than prior open-source baselines.

The 2.6B Model That Matches Eight-GPU Industrial Baselines

The headline number from the SANA-WM paper ^[1]is not the resolution and not the minute-long output — it is the parameter count. A 2.6B model trained on roughly 213K public video clips with 64 H100s over 15 days reports comparable visual quality to closed industrial baselines like LingBot-World and HY-WorldPlay at 36x higher throughput ^[2]. Until this release, single-GPU minute-scale world modeling at 720p with precise 6-DoF camera control was essentially out of reach for open-source work. Community reaction emphasized the contrast with prior open-source baselines, which typically required eight-GPU inference or dropped resolution to 480p to be tractable.

The practical effect is a shift in who can participate. The distilled variant denoises a 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization ^[1]. That price point — one consumer-grade card, half a minute of compute per clip — pulls a class of controllable video generation experiments into the budget of an individual robotics graduate student. The Coders Blog frames the same point as 'world modeling at the edge of real-time' ^[3], which is the right framing: this is not just a smaller model, it is a model whose inference cost finally clears the bar where researchers can iterate.

Hybrid Linear Attention and the D-by-D Trick That Makes Minutes Possible

Under the hood, the core architectural decision is Hybrid Linear Attention, which pairs frame-wise Gated DeltaNet with selective softmax attention ^[1]. The reason this matters specifically for minute-scale video is memory: full softmax attention over a minute of 720p frames scales quadratically with sequence length, which is exactly why prior open-source models capped out at short clips or low resolution. Gated DeltaNet keeps a recurrent hidden state that stays at constant D-by-D size regardless of how long the video runs, with softmax attention reserved for the moments where global reasoning is actually needed. That asymmetric split is the engineering trick that lets a 2.6B model do work that previously required orders of magnitude more compute.

A practitioner-oriented strand of the community response flagged the concrete failure mode that comes with this design choice: the stability of the Gated DeltaNet decay gate over very long sequences, and possible CamMC degradation along the distilled inference path. Those are exactly the failure modes you would expect from linear-attention compromises — when the recurrent state forgets too aggressively, camera-trajectory accuracy is the first metric to drift. The paper reports RotErr of 4.50° / 8.34° and CamMC of 1.41 / 1.44 on its camera-control benchmarks ^[2], numbers that suggest the trade-off is currently well-managed but that careful evaluators should re-measure on their own trajectories before trusting the model on novel control patterns.

Is It Really a World Model, or a Very Long Camera-Controlled Video?

The most interesting tension in the response is semantic. SANA-WM is marketed as a world model, but the artifacts community reviewers found in the launch demos are not the kind a true persistent-physics simulator would produce. One Hacker News reader pointed out that a book on a library table 'takes up different shapes every now and then' ^[4], while another flagged a snow-and-cave-entrance video where geometry shifts as the camera moves around it ^[4]. Skeptical voices in the broader community pushed further, openly questioning whether the 'world model' label oversells what is fundamentally a controllable video generator over a longer time horizon — a read that lined up with The Coders Blog's worry ^[3]about how SANA-WM would handle unforeseen events and occlusions in a real robotics loop.

This matters because the language matters for deployment. A world model implies persistent state — the book on the table should still be the same book three seconds later, even when the camera leaves and returns. A camera-controlled long-form video generator with strong action-following and 6-DoF trajectory adherence is genuinely useful, but it is a different product. The morphing-object failure mode is exactly what would cause a robot trained on SANA-WM rollouts to act on stale state. For now, the most accurate description is that SANA-WM is the strongest open baseline yet for camera-controlled minute-scale generation, and a useful but not-yet-trusted simulator for embodied training. The signal volume on these critiques is thin because the paper dropped within the prior 24 hours, but the angle is concrete enough to merit pressure-testing before adoption.

Why SANA-WM Is Really About NVIDIA's Stack, Not Just NVIDIA's Research

Step back from the paper and the release fits a sequence: SANA in October 2024 ^[5], the Cosmos World Foundation Model Platform in January 2025 ^[6], expanded physical-AI models including Cosmos-Predict2.5 and the Isaac GR00T N1.6 humanoid VLA in January 2026 ^[7], and now SANA-WM in May 2026 ^[8]. Each release widens the open-source surface area for physical AI — and each anchors the developer experience to NVIDIA-specific primitives. NVFP4 quantization, which is what makes the single-RTX-5090 inference path work, is NVIDIA-native.

The strategic read is that NVIDIA is doing for physical AI what it did a decade ago for deep learning: subsidizing the open primitives so the surrounding tooling, weights, and training recipes assume NVIDIA hardware by default. Releasing SANA-WM as code and weights — Hacker News users noted the project-page download buttons were initially marked 'coming soon' on launch day ^[4], a footnote on the 'open' claim — lowers the bar for academic and startup labs to build on top of it, which in turn means the next generation of robotics models will train and infer on NVIDIA silicon by reflex rather than by choice. The 36x throughput claim is the surface story; the durable story is which company's GPUs the world's embodied AI workloads end up running on.

Historical Context

2024-10

Original SANA paper released — efficient high-resolution image synthesis using a Linear Diffusion Transformer, establishing the architectural family SANA-WM extends.

2024-12

1.6B 2K-resolution Sana models released; Hugging Face diffusers added SanaPipeline support and LoRA fine-tuning, expanding adoption of the Sana stack.

2025-01

Cosmos World Foundation Model Platform introduced for physical AI, establishing NVIDIA's broader world-model line (cosmos-predict, cosmos-transfer, cosmos-reason).

2026-01

Expanded open physical-AI models and datasets released, including Cosmos-Predict2.5 and the Isaac GR00T N1.6 humanoid VLA model built on Cosmos.

2026-05-15

SANA-WM paper posted to arXiv, extending the Sana family from image synthesis into minute-scale controllable world modeling.

Power Map

Key Players

Subject

NVIDIA SANA-WM Open-Source World Model

NVIDIA / NVlabs

Publisher and developer; SANA-WM extends NVIDIA's existing Sana linear-diffusion family and its Cosmos World Foundation Model platform, with code and resources released through the NVlabs GitHub.

SANA-WM author team (Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie)

Research team behind the paper; senior involvement from Song Han and Enze Xie carries the architectural lineage from the original Sana image-synthesis work into minute-scale video.

Embodied AI and robotics research community

Primary downstream user base; SANA-WM is explicitly positioned as a single-GPU baseline for action-conditioned world modeling and simulation work that previously required industrial compute.

Closed-source industrial world models (LingBot-World, HY-WorldPlay)

Named quality benchmarks; SANA-WM frames its contribution by matching their visual quality at a fraction of the compute, making them the implicit competitive target.

Fact Check

9 cited

Source Articles

Top 3

THE SIGNAL.

Analysts

"Surprised at the quality and consistency from a model this size: "Outputting video of that quality/consistency at 1 minute, for a 2.6B model seems insane?""

Incipient

Hacker News commenter

"Flags concrete temporal-consistency artifacts in the demo footage: "First video with the guy walking the mountain in snow has consistency problems with the cave entrance.""

pferdone

Hacker News commenter

"Notes object morphing within a scene, a known failure mode for long-context video models: "the book on the table in the library video takes up different shapes every now and then.""

Leonard_of_Q

Hacker News commenter

"Contextualizes SANA-WM's value versus alternatives like LTX, which struggle with camera movement and subject-object interaction."

bobkb

Hacker News commenter

"Praises the efficiency claims but warns that benchmark performance may not translate to dynamic real-world robotics deployment given trade-offs inherent in linear attention and adaptation speed."

The Coders Blog

Independent technical blog

The Crowd

"NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU"

@u/ai-lover17

"SANA-WM modèle vidéo "world model" de NVIDIA Labs annoncé comme open source."

@u/artsnumeriques2

"SANA-WM, a 2.6B open-source world model for 1-minute 720p video"

@u/TheStartupChime1

Broadcast

SANA-WM: Efficient Minute-Scale World Model

SANA-WM: Minute-Scale World Modeling on a Single GPU

SANA-WM: RTX 5090 single-card 34 seconds to generate 1-minute 720p video