TECH

NVIDIA Nemotron 3 Nano Omni multimodal model release

38+

Signals

Strategic Overview

01.
On April 28, 2026, NVIDIA released Nemotron 3 Nano Omni, an open multimodal model unifying vision, audio, video, and text reasoning in a single inference pass for agentic workloads.
02.
The model is a 30B total / 3B active Mixture-of-Experts on a hybrid Mamba-Transformer backbone, with 23 Mamba-2 layers, 23 MoE layers (128 experts, top-6 routing plus a shared expert), and 6 grouped-query attention layers.
03.
Its encoder-projector-decoder design pairs the Nemotron 3 Nano 30B-A3B language backbone with a C-RADIOv4-H vision encoder and a Parakeet-TDT-0.6B-v2 audio encoder, supporting up to a 256K-token shared multimodal context.
04.
NVIDIA claims up to 9x higher throughput and 2.9x faster single-stream multimodal reasoning versus alternatives, with capabilities spanning document analysis, multi-image reasoning, ASR, long audio-video understanding, and agentic computer use.
05.
Weights ship in BF16, FP8, and NVFP4 under the NVIDIA Open Model Agreement permitting commercial use, and the model is immediately available on Hugging Face, OpenRouter (free), build.nvidia.com as an NIM microservice, and Amazon SageMaker JumpStart.
06.
Headline early adopters include H Company, Palantir, Foxconn, and Infosys, with Crusoe, Together AI, FriendliAI, GMI Cloud, and Vultr offering managed hosting on or near day zero.

Deep Analysis

The Latency Wall That Just Came Down

The most important number in the Nemotron 3 Nano Omni release is not a parameter count — it is the OSWorld jump from 11.1 to 47.4 on the GUI-agent benchmark, a 4.3x improvement that reframes what computer-use agents can credibly attempt. H Company CEO Gautier Cloix is explicit that this is a categorical change, not a smooth improvement: 'To build useful agents, you can’t wait seconds for a model to interpret a screen.' Until now, screen-aware agents have been bottlenecked by the round-trip cost of a vision-language model parsing each frame, which forced product teams to either downsample to thumbnails (losing button-level precision) or accept multi-second latency that breaks the interaction loop.

Omni's claim is that interpreting full-HD screen recordings in real time is finally tractable on a single open model with 3B active parameters. That economic argument matters more than any single benchmark: the cost of running a perception loop scales linearly with active parameters, not total parameters, so a 30B-A3B MoE running at the throughput of a 3B dense model can power thousands of concurrent agent sessions per GPU. If H Company's deployment generalizes, the gating constraint on browser/computer-use agents shifts from raw model capability to scaffolding, memory, and recovery — engineering problems, not research ones.

Why a Hybrid Mamba-Transformer MoE Is the Right Shape for Multimodal

Most multimodal models pay a quadratic attention tax on every video frame and audio chunk they ingest, which is why long-context video reasoning has been the weakest link in the open-model stack. Omni's architecture is an explicit answer to that constraint. The backbone interleaves 23 Mamba-2 selective state-space layers — which process long sequences in linear time and constant memory — with 23 MoE layers that route tokens to 6 of 128 experts plus a shared expert, and just 6 grouped-query attention layers placed where global token-to-token interaction is actually needed. The NVIDIA technical team frames the trade explicitly: Mamba layers handle 'sequence and memory efficiency,' transformer layers handle 'precise reasoning,' and MoE provides conditional capacity without inflating the active-parameter cost.

The payoff is visible in the system metrics, not just the accuracy scores. NVIDIA reports 7.4x higher system efficiency for multi-document use cases and 9.2x for video — numbers that only make sense in an architecture where most layers do not pay attention's quadratic cost. The encoder-projector-decoder design (C-RADIOv4-H for vision, Parakeet-TDT-0.6B-v2 for audio, projected into the Nemotron 3 Nano 30B-A3B language backbone) lets the 256K-token context be a genuinely shared multimodal buffer rather than a token-budget that vision and audio compete for. This is the design choice that makes 'one inference pass for vision, audio, video, and text' a real engineering claim instead of marketing collage.

Open Beats Closed on Multimodal Throughput — and the Receipts Are Independent

Vendor throughput claims are easy to dismiss, which is why the Coactive MediaPerf numbers are the most important external validation in this launch. On the tagging task, Omni hits 9.91 hours of video processed per hour of compute — 5x GPT 5.1, 6x Gemini 3.0 Pro, 2.6x Qwen3-VL. On summarization, it hits 10.79 h/h — 4x both frontier closed models and 2.7x the leading open competitor. These are independent benchmarks against the actual closed-source incumbents, not cherry-picked subsets, and they invert the usual narrative that closed labs hold a quality-per-dollar lead in multimodal.

The strategic implication is that NVIDIA has just made the throughput-per-dollar argument for closed multimodal models structurally weaker. A media-intelligence team building tagging pipelines, a compliance team summarizing video evidence, an OCR pipeline scaling to millions of documents — these workloads are throughput-bound, and Omni at OpenRouter's free tier or self-hosted on Hugging Face weights now sits 4-6x ahead of GPT 5.1 and Gemini 3.0 Pro on the metric they actually care about. AWS naming the same use cases — Q&A, summarization, transcription, OCR, document intelligence — in its SageMaker JumpStart positioning suggests the cloud channel agrees: this is not the model you reach for to write code, it is the model you reach for to read the world at scale.

The Synthetic-Data Disclosure Is the Quiet Story

Buried in the Hugging Face model card and surfaced by The Decoder is something genuinely unusual: NVIDIA disclosed that Omni's training pipeline drew on outputs from Qwen3-VL-30B, gpt-oss-120b, Kimi-K2.5, DeepSeek-OCR, GPT-4o, and Gemini 3 Flash Preview, alongside roughly 11.4M synthetic VQA pairs (~45B tokens) and 2.3M+ RL rollouts across 25 environment configurations. Most labs treat their synthetic-data sources as the crown jewels and either decline to name them or strongly imply everything was generated in-house. NVIDIA is publishing the recipe.

That candor is partly possible because NVIDIA is not a foundation-model competitor in the OpenAI/Google sense — it sells the platform underneath all of these models, so distilling from the field is consistent with its commercial position rather than awkward for it. But the disclosure also functions as a reproducibility signal to the open ecosystem: it tells research teams which teacher models actually worked, which reduces the wasted compute spent rediscovering this. Combined with the 25-trillion-token pretraining corpus, ~127B tokens of mixed-modality data, and ~124M curated post-training examples, the Omni training narrative is one of the most fully specified for any frontier-class open multimodal model — and that documentation is itself a competitive moat against closed labs that cannot match it without breaking their own secrecy norms.

What Builders Should Actually Use It For — And Where Reddit Is Right

The community reception has split along a useful axis. Builders running it locally — including reports of the full 267K context window working on a 5090+3090 rig — describe it as 'probably the fastest model I have ever seen' and 'good and fast for video,' which lines up with the throughput numbers. The friction is twofold: the 4-bit quant lands at ~25GB and 8-bit at ~36GB, which awkwardly overshoots the 24GB consumer ceiling on a single card, and the coding ability is genuinely poor. Reports of black-screen HTML hallucinations and Qwen 35B beating it on a C89 task are not edge cases — NVIDIA partisans correcting people online with 'this is not a coding model' is the right framing, but the community is right to flag it.

The practical guidance that emerges is sharper than the launch marketing implies. Reach for Omni when the workload is QA over messy multimodal documents, OCR at scale, video tagging and summarization, ASR on long audio, or agentic computer use where latency dominates. Do not reach for it as a drop-in replacement for a coding model or a general-purpose chat assistant — the post-training mix is heavy on multimodal grounding and light on code. The English-only ASR limitation is also a real constraint for global teams (especially given that NVIDIA's own Parakeet supports 25 other languages elsewhere). Read against this honest picture, the model is exactly what its design implies: a perception-and-reasoning workhorse for agents that need to see, hear, and read the world fast — not a generalist.

Historical Context

2023-11

First public Nemotron-branded release: Nemotron-3 8B aimed at enterprise chatbots and copilots.

2024-06

Released the Nemotron-4 340B family (Base/Instruct/Reward), establishing Nemotron as a synthetic-data and instruction-tuning workhorse.

2025-01

At CES 2025, NVIDIA announced the Llama Nemotron family in Nano, Super, and Ultra tiers — formalizing the size-tier branding Omni now extends.

2025-08

Published the Nemotron Nano 2 technical report introducing the hybrid Mamba-Transformer recipe that Omni inherits and extends to multimodal inputs.

2025-12

Announced the Nemotron 3 family and shipped Nemotron 3 Nano (text) first, setting up Omni as the multimodal follow-on.

2026-04-28

Released Nemotron 3 Nano Omni — the first omni-modal entry in the Nemotron 3 family — alongside SageMaker JumpStart, OpenRouter, and Hugging Face availability.

Power Map

Key Players

Subject

NVIDIA Nemotron 3 Nano Omni multimodal model release

NVIDIA

Developer and publisher; uses Omni to anchor its agentic AI platform strategy and the NIM microservice ecosystem, leveraging open weights to seed downstream lock-in into its inference stack.

H Company

Lead agentic-AI design partner using Omni to interpret full-HD screen recordings in real time, validating the model's claim to be the first practical substrate for browser/computer-use agents.

Palantir, Foxconn, Infosys

Enterprise launch adopters spanning defense/intelligence, manufacturing, and IT services; their presence signals that NVIDIA is positioning Omni for regulated and operations-heavy workloads, not consumer chat.

AWS / Amazon SageMaker JumpStart

Primary cloud distribution channel; offers the model with 131K-token context, chain-of-thought reasoning, tool calling, and JSON output, putting Omni in front of every SageMaker enterprise tenant.

Hugging Face and OpenRouter

Open-weights and free hosted-API distribution; Hugging Face hosts BF16/FP8/NVFP4 checkpoints while OpenRouter provides a free endpoint, ensuring developer reach beyond paying enterprise channels.

Coactive (MediaPerf)

Independent benchmark organization whose third-party throughput numbers became the canonical evidence cited by partners — converting a vendor claim into a defensible market position against GPT 5.1 and Gemini 3.0 Pro.

Source Articles

Top 5

THE SIGNAL.

Analysts

"Cloix frames Omni's speed not as a benchmark win but as a categorical change: 'By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn’t practical before. This isn’t just a speed boost: It’s a fundamental shift in how our agents perceive and interact with digital environments in real time.'"

Gautier Cloix

CEO, H Company

"On why latency, not capability, is the binding constraint for agent usefulness: 'To build useful agents, you can’t wait seconds for a model to interpret a screen.'"

Gautier Cloix

CEO, H Company

"Highlights NVIDIA's unusual transparency about its synthetic-data pipeline — which drew on Qwen3-VL-30B, gpt-oss-120b, Kimi-K2.5, DeepSeek-OCR, GPT-4o, and Gemini 3 Flash Preview — and notes the OSWorld GUI-agent leap: 'On the OSWorld benchmark for GUI agents, accuracy jumps from 11.1 to 47.4 points compared to the previous version.'"

The Decoder

Independent tech publication

"Reports Omni dominated every benchmarked task across open and closed-source models: 'Tagging: 9.91 h/h (5x GPT 5.1, 6x Gemini 3.0 Pro, 2.6x Qwen3-VL), Summarization: 10.79 h/h (4x GPT 5.1, 4x Gemini 3.0 Pro, 2.7x Qwen3-VL).'"

Coactive (MediaPerf)

Independent benchmark organization

"Frames the architectural choice as the central efficiency lever: the design 'combines Mamba layers for sequence and memory efficiency with transformer layers for precise reasoning,' enabling long-context multimodal inputs at low active-parameter cost."

NVIDIA Technical Blog

First-party engineering communication

The Crowd

"Meet Nemotron 3 Nano Omni 👋 Our latest addition to the Nemotron family is the highest efficiency, open multimodal model with leading accuracy. 30B parameters. 256K context length. 🧵👇"

@@NVIDIAAI0

"Excited to support @NVIDIA Nemotron 3 Nano Omni, now available on Fireworks. It is the first open model that handles vision, audio, video, and text in a single inference loop. Built for multimodal sub-agents at scale, with 9× higher throughput than Qwen3 30B. 256K context."

@@FireworksAI_HQ0

"NVIDIA Nemotron 3 Nano Omni is now available on Amazon SageMaker JumpStart. This multimodal model supports video, audio, image, and text, enabling enterprise Q&A, summarization, transcription, OCR, and document intelligence."

@@AWSAI0

"NVIDIA releases Nemotron-3-Nano-Omni"

@u/yoracale295

Broadcast

Introducing NVIDIA Nemotron 3 Nano Omni

Nvidia Nemotron 3 Nano Omni - First Test and Impression

NVIDIA Nemotron 3 Nano Omni — See, Hear & Read Everything Locally