TECH

Thinking Machines Interaction Models

34+

Signals

Strategic Overview

01.
Thinking Machines Lab unveiled its first 'interaction models' on May 11, 2026 — a multimodal system that processes audio, video, and text simultaneously in 200-millisecond chunks for full-duplex human-AI conversation.
02.
The flagship TML-Interaction-Small is a 276-billion-parameter Mixture-of-Experts model with 12 billion active parameters, paired with an asynchronous background model for longer reasoning and tool use.
03.
Rather than using separate encoders, the architecture takes audio as dMel signals and video as 40x40 patches through lightweight embedding layers co-trained with the transformer in an 'encoder-free early fusion' setup.
04.
A limited research preview will open to select partners in the coming months, with a wider public release planned for later in 2026.

Deep Analysis

The Pipeline Problem They're Trying to Kill

The central technical argument behind interaction models is not that they're faster — it's that everyone else's voice stack is structurally blind. In today's real-time products like GPT-Realtime and Gemini Live, audio streams in continuously, but the underlying language model never sees the raw signal; it only receives a finalized transcript after a separate ASR layer decides the user has stopped talking ^[1]. That hand-off is where backchannels, mid-sentence corrections, and visual interjections go to die. The model literally cannot listen while it talks because, at the layer where reasoning happens, it isn't connected to the microphone.

Thinking Machines' fix is to push the multimodal stream all the way into the transformer. Audio enters as dMel signals, video as 40x40 patches, and both are processed through lightweight embedding layers co-trained with the model itself in what the lab calls 'encoder-free early fusion' ^[2]. The system then runs on 'micro-turns' that interleave 200ms of input ingestion with 200ms of generation, so the same model is simultaneously hearing, watching, and speaking ^[2]. A separate asynchronous background model handles longer chain-of-thought reasoning and tool calls in parallel, so latency stays low while the system can still 'think.' The bet is that this is the only design where interactivity will continue to work as base models get bigger and slower.

The Benchmark Gap That Will Trigger Replatforming

The numbers are the part that will force engineering decisions inside OpenAI and Google. On the FD-bench latency test, TML-Interaction-Small posts 0.40s turn-taking latency, versus 0.57s for Gemini-3.1-flash-live and 1.18s for GPT-Realtime-2.0 — roughly a 3x lead over OpenAI's flagship voice model ^[2]. On FD-bench v1.5 interaction quality, TML scores 77.8 versus 46.8 for GPT-Realtime-2.0 in its minimal configuration ^[2]. Third-party analysis notes that flipping on the background reasoning model lifts the BigBench Audio score from roughly 76% to roughly 96% ^[3].

The one place TML doesn't sweep is Audio MultiChallenge accuracy (APR), where it lands at 43.4% against GPT-Realtime-2's 48.5% in 'xhigh' mode ^[4]— a useful reminder that this is a 12B-active-parameter MoE, not the largest model in the field. The signal to incumbents is specific: optimizing the existing pipeline (lower ASR latency, better VAD, smaller TTS) is unlikely to close the FD-bench gap, because the gap is architectural. If the benchmark holds up under independent testing, OpenAI and Google have to decide whether to keep tuning pipelines or rebuild voice products around native multimodal transformers — an expensive replatforming.

Incremental Trick or Architectural Reset?

The expert reception is split in a way worth taking seriously before anyone calls this a paradigm shift. Sean Goedecke, writing the most detailed third-party teardown, reads interaction models as 'a scaled-up, multimodal version of existing fully-duplex models like Moshi, with a real model bolted on for extra intelligence' ^[3]— the prior art is there, the scale is new, and most of the benchmark lift comes from the background reasoning model, which he says 'remains to be seen how well it works in practice' ^[3]. Implicator.ai's Marcus Schuler takes the opposite line, arguing that the launch turns interaction timing from a 'wrapper layer' problem into a core architectural concern that every other lab now has to answer ^[1].

Community reception splits along the same fault line. Hands-on developer corners reacted enthusiastically and quickly noted that several people in the demos came from OpenAI's original Advanced Voice team — a credibility signal for the technical claims. Generalist AI forums were more skeptical, with the dominant complaint being that lower latency is not what gets a lab closer to AGI; the framing was that a $12B valuation should produce a step-change in capability, not a smoother voice toy. The fairer read is probably that interaction models are an incremental research advance in technique but a real architectural reset in product framing — and which of those two readings dominates will depend on what the preview partners are able to build.

Why Murati's First Foundation Model Is a Voice Model

It is worth asking why the lab whose first product was Tinker — a fine-tuning API for open-weight LLMs ^[5]— chose real-time interaction as the place to spend its first in-house foundation model. The strategic logic gets clearer when you stack the pieces. Thinking Machines closed a $2B seed at a $12B valuation in July 2025 with no product, on the strength of Murati and a roster pulled almost entirely from OpenAI and other top labs ^[6]. The $10M from the Albanian government required a literal budget amendment ^[5]. That kind of capital and political capital demands a launch that visibly does something the incumbents cannot.

A frontier text model would have been a direct collision with GPT-5-class systems on the dimension where OpenAI, Anthropic, and Google have the clearest scale advantage. A native full-duplex multimodal model attacks a different surface — one where OpenAI's exposed weakness is exactly what Thinking Machines is built to exploit, given that several people in the demos came from OpenAI's original Advanced Voice team. It also lines up with the Nvidia compute partnership announced in March 2026 ^[5], which makes the kind of long, multimodal training runs this architecture requires economically viable. Reading the launch as 'voice' undersells it; it is a deliberate sidestep around the part of the model market where the incumbents are strongest, into the part where they are most architecturally constrained.

Historical Context

2023-11

Briefly served as interim CEO of OpenAI during Sam Altman's short ouster before his reinstatement days later.

2025-02

Founded by Mira Murati alongside ex-OpenAI researchers including John Schulman and Barret Zoph.

2025-07

Closed a record $2B seed round led by a16z at a $12B valuation, with Nvidia and the Albanian government among the participating investors.

2025-10-01

Launched Tinker, an API for fine-tuning open-weight language models on the company's internal infrastructure — its first product.

2026-01

Co-founder Barret Zoph departed Thinking Machines.

2026-03

Announced a strategic partnership with Thinking Machines featuring additional investment and a multi-year compute capacity agreement.

2026-05-11

Unveiled TML-Interaction-Small and the interaction-models paradigm — the company's first in-house foundation model.

Power Map

Key Players

Subject

Thinking Machines Interaction Models

Mira Murati

Founder and CEO of Thinking Machines Lab and former OpenAI CTO; sets the lab's product direction and is the public face of the interaction-models launch.

John Schulman

Chief Scientist at Thinking Machines; one of the senior research leads behind the lab's first in-house foundation model.

OpenAI

The primary architectural foil — Thinking Machines explicitly frames GPT-Realtime's pipeline design as 'what OpenAI gets wrong about voice,' which will pressure OpenAI to defend or rebuild its real-time stack.

Google

Closest direct competitor; its Gemini-3.1-flash-live posts 0.57s turn-taking latency versus TML's 0.40s, making it the only other lab plausibly close to a native full-duplex architecture.

Nvidia

Seed investor and strategic compute partner whose March 2026 multi-year capacity deal underwrites the training of a 276B-parameter MoE without forcing Thinking Machines into a hyperscaler dependency.

Andreessen Horowitz

Lead investor in the July 2025 $2B seed round at a $12B pre-product valuation; the launch is the first product milestone validating that bet.

Fact Check

6 cited

Source Articles

Top 4

THE SIGNAL.

Analysts

"Interactivity cannot be a wrapper bolted onto a pretrained text model — it has to be native: 'For interactivity to scale with intelligence, it must be part of the model itself.'"

Thinking Machines Lab researchers

Authors, interaction-models announcement

"Calls interaction models incremental rather than fundamental — 'a scaled-up, multimodal version of existing fully-duplex models like Moshi, with a real model bolted on for extra intelligence.'"

Sean Goedecke

Software engineer and independent AI commentator

"Argues the launch elevates timing from a wrapper layer into a core architectural concern: 'The product test starts when a user changes their mind mid-sentence.'"

Marcus Schuler

Writer, Implicator.ai

The Crowd

"Mira Murati's Thinking Machines introduces Interaction Models: A Scalable Approach to Human-AI Collaboration"

@u/obvithrowaway3443463

"Interaction Models: A Scalable Approach to Human-AI Collaboration - Thinking Machines"

@u/daddyhughes11132

"Mira Murati shared Thinking Machines's latest development of handling real-time interactions (blog in her thread)"

@u/Informal-Fig-711624

Broadcast

Introducing interaction models | Thinking Machines Lab

Simultaneous speech | Thinking Machines Lab

Visual interjection | Thinking Machines Lab