One model, three stacks: the world-model bet behind Omni
The most distinctive thing about Gemini Omni is not what it generates but what it claims to model. According to DeepMind, Omni is built by fusing three previously separate lineages — the Gemini reasoning stack, the Veo video backbone, and the Genie world-simulation layer — into a single unified architecture [1]. The pitch is that the model does not draw frames the way diffusion video models traditionally do; it reasons about a scene, then synthesizes the next state in a way that is supposed to respect physical laws and continuity across edits. Decrypt summarized the DeepMind framing bluntly: Omni is a world model AI that can understand and simulate the world [2]. EfficientlyConnected's Paul Nashawaty went further, arguing that a native world model changes the operational surface area of enterprise AI applications, not just the creative output [3].
This matters because it lines up with DeepMind's longer-running AGI thesis. Hassabis used the keynote to position Omni as a step toward general intelligence, and the supporting blog post leans on the same vocabulary [4]. The technical bet is that grounding a generative model in something resembling a physics simulator is what eventually lets agents act in the real world rather than just talk about it. Whether that bet pays off is unfalsifiable today, but it explains why Google chose to launch Omni as a sibling architecture to Gemini 3.5 rather than as a standalone video product: in the company's telling, this is the same intelligence stack that will eventually drive Spark, Antigravity, and agentic Search.



