TECH

OpenAI launches GPT-Realtime-2 voice models in Realtime API

46+

Signals

Strategic Overview

01.
On May 7, 2026, OpenAI launched three new voice models in its Realtime API — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — expanding the production voice agent stack for developers.
02.
GPT-Realtime-2 is the first OpenAI voice model with GPT-5-class reasoning, exposes five reasoning levels (minimal through xhigh), supports parallel tool calling, verbal preambles, and tone control, and expands the realtime context window from 32K to 128K tokens — roughly 1-2 hours of dense raw audio.
03.
GPT-Realtime-Translate is a dedicated speech-to-speech translation model trained on thousands of hours of professional interpreter audio, accepting 70+ input languages but speaking only 13 output languages: Spanish, Portuguese, French, Japanese, Russian, Chinese, German, Korean, Hindi, Indonesian, Vietnamese, Italian, and English.
04.
GPT-Realtime-Whisper is a low-latency streaming transcription model with a controllable latency dial — lower delays produce earlier partial text, while higher delays improve transcript quality — and is priced at $0.017 per minute, dramatically below the flagship reasoning model.

Reasoning grafted onto a streaming voice path — and the latency tax it introduces

The headline claim is that GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning, exposed via five adjustable levels from minimal to xhigh. In practice, that means the model can deliberate mid-conversation: weigh tool calls in parallel, plan a multi-step action, and emit a verbal preamble like 'let me check that' to keep the human engaged while it thinks. Combined with a 128K-token context window — which OpenAI characterizes as roughly one to two hours of dense raw audio — the architecture is unambiguously aimed at long-running voice agents that must hold state across a real call rather than answer one-shot prompts.

The tension is that reasoning costs time, and voice users famously punish latency. Developer-community responses surfaced this directly: one early tester argued the new 'thinking' gains do not translate cleanly to a voice surface where users expect instant replies, and others felt the model came across as smarter but more robotic than the GPT-4o voice demos from May 2024. OpenAI's design answer is the reasoning-level dial plus the preamble pattern — let the model talk while it thinks — but builders will need to pick a level per turn rather than per session, and treat 'minimal' as the default for chitchat and 'xhigh' as a tool the agent reaches for only when a task actually requires deliberation.

The unbundling: three models, three price points, and a mix-and-match pipeline

Rather than ship one omnimodel, OpenAI split the launch into a flagship reasoner (GPT-Realtime-2), a dedicated speech-to-speech translator (GPT-Realtime-Translate at $0.034/minute), and a low-latency streaming transcriber (GPT-Realtime-Whisper at $0.017/minute). The flagship reasoner is markedly more expensive — $32 per million audio input tokens and $64 per million audio output tokens, with $0.40 per million cached input tokens — so the natural architecture is a router: send small talk and translation through cheap specialized models, and only escalate to the reasoner when a tool call, plan, or compliance check is actually needed.

This unbundling has commercial consequences. Developers running an interview-training product reported that the realtime price tier broke their per-minute economics; another commenter argued that at these prices, Gemini still wins on live streaming voice. The counter-argument from OpenAI's customer list is that enterprise voice work — Zillow's home-search assistant, Deutsche Telekom's multilingual support, Priceline's travel flows — pays back the reasoning premium on call success and compliance robustness, not on cost-per-minute. The split product line essentially lets builders draw their own line: cheap when you can, reason when you must.

The 70-to-13 language asymmetry, and what 'live translation' actually means here

GPT-Realtime-Translate accepts 70+ input languages but speaks only 13 output languages: Spanish, Portuguese, French, Japanese, Russian, Chinese, German, Korean, Hindi, Indonesian, Vietnamese, Italian, and English. That asymmetry is a deliberate product call. Recognizing speech across the long tail of world languages is mostly a data and acoustic-modeling problem; producing natural, professional-grade voice output in a target language is a much harder synthesis-and-prosody problem, especially after training on thousands of hours of professional interpreter audio. OpenAI shipped a model that can hear almost everyone but only speak back to a subset.

For builders, that has concrete product implications. A French-speaking support agent using the Voice-to-Voice pattern can reliably take calls from Tagalog, Swahili, or Bengali speakers, but a Tagalog-speaking customer cannot yet get a fluent Tagalog reply — only one of the 13 languages back. India-focused builder BolnaAI reported a 12.5% reduction in word error rate on Indian-language workloads, suggesting input-side recognition is the strong half of the system today. Expect the output language list to be the leading edge of future updates, and design product UX so users see clearly which direction is asymmetric before they pick a session language.

Benchmark trajectory: the audio reasoning curve is steepening

On the audio MultiChallenge benchmark, OpenAI's voice models have moved roughly 20.6% (December 2024) to 30.5% (August 2025) to 48.5% at the xhigh setting in May 2026 — the most recent jump alone is about 13.8 percentage points over the prior generation, dwarfing the year-long climb that preceded it. On Big Bench Audio at the high setting, GPT-Realtime-2 reaches 96.6% accuracy versus 81.4% for its predecessor, a 15.2 point jump. Read together, the two benchmarks suggest that audio reasoning is now improving generation-over-generation at roughly the pace text reasoning was improving in 2023-2024, with the biggest gains showing up at the highest reasoning-effort setting rather than at the default tier.

The practical takeaway for builders is that the gap between 'easy' realtime audio (greetings, simple slot-filling) and 'hard' realtime audio (multi-step task planning, tool routing, compliance reasoning) is widening fast inside the same model family. A voice agent design that conservatively assumed the model could not handle complex turns is probably leaving capability on the table; conversely, the gains at xhigh suggest the value of dynamic reasoning-level routing — turning the dial up only on the turns that benefit from it — is structural, not transitional.

Where the developer community is pushing back

Three threads of skepticism are surfacing alongside the launch enthusiasm. First, pricing: operators of consumer-priced voice products report the realtime tier breaks their economics, and several builders argue Gemini Live remains more attractive purely on live-streaming cost. Second, a naturalness regression versus older voice-model demos — the new model is described as smarter but more robotic, and some developers want talk-over-each-other turn-taking closer to the NVIDIA-style demos rather than strict barge-in. Third, surface availability: launch day was API-only, with the new models not yet wired into ChatGPT or Codex apps, which capped how quickly mainstream users could feel the upgrade.

More tactical concerns surfaced too: a male voice user reported that female TTS voices began speaking in a baritone when his pitch shifted, suggesting voice-conditioning behavior is leaking across speakers. Other developers shared local-stack benchmarks combining faster-whisper with Kokoro TTS that meet or beat commercial alternatives on speed alone, and pointed to LiveKit-style streaming pipelines as the practical glue layer for production. None of this is fatal to the launch — enthusiasm dominates the social conversation, and the translate model in particular drew unambiguous excitement for travel and accessibility — but the early feedback signals that the production gap between 'impressive demo' and 'shippable agent' will be closed by builders, not by OpenAI alone.

Historical Context

2024-10

OpenAI introduced the Realtime API in public beta, launching a low-latency multimodal API leveraging the speech-to-speech capabilities of GPT-4o.

2024-12

OpenAI shipped its first generation Realtime voice model alongside the API; that earlier model scored 20.6% on the MultiChallenge audio benchmark.

2025-08

OpenAI released gpt-realtime, taking the Realtime API to general availability with MCP server support, image input, SIP phone calling, and two new voices (Cedar, Marin); it scored 30.5% on the MultiChallenge audio benchmark.

2026-05

OpenAI launched GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, expanding context to 128K tokens and adding GPT-5-class reasoning, dedicated translation, and dedicated low-latency transcription models.

Power Map

Key Players

Subject

OpenAI launches GPT-Realtime-2 voice models in Realtime API

OpenAI

Developer of GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, distributing all three through the Realtime API and the Playground with EU data residency support.

Greg Brockman

OpenAI co-founder who publicly announced the GPT-Realtime-2 reasoning model launch on May 7, 2026.

Zillow

U.S. online real estate marketplace testing GPT-Realtime-2 to build a home-search voice assistant, reportedly seeing improvements in call success rates and compliance robustness.

Deutsche Telekom

European telecom firm using the Voice-to-Voice pattern to build customer-support tools where callers speak their preferred language while the AI translates the conversation live.

BolnaAI

India-focused voice AI builder providing feedback to OpenAI; reported quantitative quality gains for GPT-Realtime-Translate on Indian-language workloads.

Microsoft Azure AI Foundry

Distributing OpenAI's new Realtime models on Azure as part of a joint 'new chapter' for realtime AI, extending availability beyond the OpenAI API surface.

Source Articles

Top 5

THE SIGNAL.

Analysts

"Reports concrete quality gains for GPT-Realtime-Translate on Indian-language audio, where local accents and code-switching have historically degraded ASR quality."

Prateek Sachan

Co-founder and CTO, BolnaAI

"Positions the trio as moving voice AI from chatty assistants to real-time collaborators that listen, reason, and solve complex problems as conversations unfold — explicitly framing it as voice agents that act, not just respond."

OpenAI (corporate statement)

Vendor framing the launch

"Frames GPT-Realtime-2 as OpenAI's competitive answer to Google Gemini Live, prioritizing natural longer conversations over raw response speed."

Aamir Khollam

Tech reporter, Interesting Engineering

The Crowd

"Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API"

@@OpenAI0

"Voice agents are getting more capable. Here's what's new: GPT-Realtime-2 for voice agents that reason and take action; GPT-Realtime-Translate enabling translation from 70 input languages into 13 output languages; GPT-Realtime-Whisper, making transcription even faster"

@@OpenAIDevs0

"OpenAI just dropped three new realtime voice models: GPT-Realtime-2 (with GPT-5-class reasoning for voice agents that can actually think mid-conversation), GPT-Realtime-Translate (live translation across 70+ input languages), and GPT-Realtime-Whisper (streaming transcription)"

@@kimmonismus0

"We're introducing three audio models in the API that unlock a new class of voice apps for developers."

@u/OpenAI153

Broadcast

We're introducing three audio models in the API

GPT-Realtime-2, Directionally Bad and Agent Memory

OpenAI's NEW Voice Agent Model - GPT-RealTime 2 is dope!