OpenAI Realtime API: GPT-Realtime-2, Translate, and Whisper Voice Models
TECH

OpenAI Realtime API: GPT-Realtime-2, Translate, and Whisper Voice Models

52+
Signals

Strategic Overview

  • 01.
    On May 7, 2026, OpenAI exited beta on its Realtime API and launched three new audio models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — positioning voice as a primary interface for AI agents that can listen, reason, translate, and act mid-conversation.
  • 02.
    GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning, expanding the context window from 32K to 128K tokens and exposing a five-level reasoning_effort knob (minimal, low, medium, high, xhigh) so developers can trade latency for depth.
  • 03.
    GPT-Realtime-Translate handles 70+ input languages into 13 output languages and was trained on thousands of hours of professional interpreter audio so it stays translation-only and waits for sufficient context before speaking.
  • 04.
    GPT-Realtime-Whisper is a streaming speech-to-text model with developer-tunable latency — lower delay yields earlier partial text, higher delay improves transcript quality — and ships alongside two new exclusive API voices, Cedar and Marin.

The Reasoning-While-Speaking Knob

The headline change in GPT-Realtime-2 is not a louder voice or a smoother prosody — it is a five-position dial labeled reasoning_effort, with stops at minimal, low, medium, high, and xhigh. Crank it up and the model burns more thinking tokens before opening its mouth; leave it at the default low and it answers fast and shallow, the way prior real-time models always did. The trade-off is concrete: time-to-first-audio runs about 1.12 seconds at minimal effort and stretches to 2.33 seconds at high. That is a developer-facing latency budget exposed as a parameter, not a black-box choice OpenAI made for you.

What the knob actually buys is captured in the benchmarks. On Big Bench Audio, GPT-Realtime-2 at high effort scores 96.6% versus 81.4% for the prior generation — a 15.2-point absolute jump. On Audio MultiChallenge instruction-following, xhigh hits 48.5% against 34.7% before, a 13.8-point gain. Beyond raw accuracy, the model can call multiple tools in parallel and narrate what it is doing while it does it, so a flight-search agent does not go silent for three seconds while a tool round-trips. When something does fail, the model can verbally signal trouble — 'I'm having trouble with that right now' — instead of producing the hallmark dead-air that has plagued every prior voice agent. Together those behaviors — graded reasoning, parallel tool calls, in-flight narration, graceful failure — are the mechanical reason early adopters describe this as a different category of voice agent rather than an incremental upgrade.

The $32-Per-Million-Token Voice

Reasoning-grade voice is not cheap. GPT-Realtime-2 lists at $32 per million audio input tokens and $64 per million audio output tokens, with cached input dropping dramatically to $0.40 per million. The translation model bills at $0.034 per minute and Whisper at $0.017 per minute, both flat-rate by audio duration. For comparison, those audio rates sit roughly an order of magnitude above text inference on flagship models, which means a long, reasoning-heavy customer-support call can rack up a meaningful per-conversation cost — something contact-center operators paying pennies-per-minute on legacy IVR will feel immediately.

The cached input price is the load-bearing number for production economics. At $0.40 per million, repeated system prompts, tool schemas, and long-running conversation histories become nearly free on retrieval, which is why OpenAI keeps emphasizing the 128K context window: the architectural bet is that voice agents will hold long conversations, lean heavily on cache hits, and amortize the headline rate. That makes session shape — short bursty calls versus long sustained dialogues — the variable that decides whether GPT-Realtime-2 is operationally affordable for a given workload. Builders running scripted, high-volume, low-context flows will likely pair Whisper for transcription with cheaper text-mode reasoning; builders running open-ended, multi-step agents will pay the premium and bank on cache.

Where Local Stacks Still Win

The developer reaction split sharply. Inside the more accelerationist corners of Reddit the energy was straightforwardly positive — GPT-5-class reasoning landing in a voice model, live translation for travel, the three-model stack on a single API. A more skeptical practitioner thread carried real weight, though: a chunk of the community argues the realtime API for raw TTS and STT is 'mostly pointless' because local pipelines built around Kokoro TTS preloaded in VRAM, faster-whisper, and NVIDIA's realtime STT already clear roughly 80 ms end-to-end — well under what any cloud round-trip can match. One developer claimed the OpenAI demo's text-finishes-before-audio-plays cadence was slower than their local pipeline running at 2x speed.

The critique is narrower than it sounds. Local stacks win on raw latency for thin TTS/STT, but they do not have GPT-5-class reasoning, parallel tool use, interruption handling, or 70-language live translation. The real question for builders is which of those capabilities they actually need in the loop. A meeting-transcription product probably should not pay $32-per-million-tokens for cloud STT when faster-whisper runs locally. A travel booking voice agent that needs to reason about flight constraints and call three tools mid-utterance has no realistic local equivalent today. Running through that thread is a related complaint — that ChatGPT consumer Voice Mode still feels noticeably worse than the May 2024 demo — and Simon Willison flagged the same disconnect from a more measured angle. The product surface that actually got upgraded is the one developers touch, not the one most consumers experience.

Why Zillow's 26-Point Jump Matters Beyond Real Estate

Why Zillow's 26-Point Jump Matters Beyond Real Estate
Prior model vs. GPT-Realtime-2 across four production benchmarks (DataCamp, The Next Web, The Rundown AI, Latent Space — May 2026).

Zillow's headline number — 95% versus 69% on its hardest adversarial benchmark, a 26-point lift — gets cited as a generic 'voice agents are better now' stat. The more interesting detail is what that benchmark is for. Zillow runs a real-estate voice assistant tied to its BuyAbility feature, and the company specifically calls out improved Fair Housing compliance as critical for its deployment. Adversarial benchmarks in real estate are not abstract eval rigor; they are the test set you use to make sure your agent does not steer protected classes toward or away from neighborhoods, does not make discriminatory recommendations, and does not produce phrasing a regulator could call out.

That reframes the launch as something more than a voice-quality upgrade. Regulated industries — housing, healthcare, finance, telecom — have stayed cautious on voice agents not because the speech sounded bad but because earlier models flunked precisely the adversarial cases compliance teams care about. A 26-point jump on the hardest of those cases is the kind of evidence a legal team needs to greenlight production deployment. Deutsche Telekom's multilingual support pilot points the same direction: cross-language voice support runs straight into consumer-protection rules, and translation that 'actually works incredibly well' (as OpenAI's Boris Power put it) is a prerequisite for shipping. The pattern across Zillow, Deutsche Telekom, Priceline, and BolnaAI is that the launch customers OpenAI surfaced are exactly the ones for whom 'good enough' had to clear a much higher bar than entertainment voice — and that is the market this release is built to unlock.

Historical Context

2024-10-01
Announced the Realtime API in public beta at its annual developer conference, enabling paid developers to build low-latency multimodal voice experiences powered by gpt-4o-realtime-preview.
2024-12-17
Added WebRTC as a transport option, released gpt-4o-realtime-preview-2024-12-17 and gpt-4o-mini-realtime-preview-2024-12-17, and cut audio token prices by 60%.
2025-08-01
Realtime API reached general availability with the launch of the purpose-built gpt-realtime model for production voice agents.
2026-05-07
Launched GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, fully exiting beta with new voices Cedar and Marin and partner adopters Zillow, Priceline, Deutsche Telekom, Vimeo, BolnaAI, Glean, and Genspark.

Power Map

Key Players
Subject

OpenAI Realtime API: GPT-Realtime-2, Translate, and Whisper Voice Models

OP

OpenAI

Developer of the Realtime API and the three new audio models, positioning voice as a primary AI interface and competing for production voice agent workloads against Google, ElevenLabs, and local open-source stacks.

ZI

Zillow

Early adopter building a voice real-estate assistant tied to its BuyAbility feature; reports a 26-point lift in call-success rate (95% vs 69%) on its hardest adversarial benchmark and improved Fair Housing compliance.

DE

Deutsche Telekom

Testing GPT-Realtime-2 for multilingual customer-support voice interactions, betting on lower latency and translation fluency to streamline cross-language conversations across its European footprint.

PR

Priceline

Early adopter building voice-driven travel booking that lets users search flights and hotels conversationally, leaning on parallel tool calls and reasoning-while-speaking to handle multi-step itineraries.

BO

BolnaAI

Voice agent platform reporting 12.5% lower word error rates on Hindi, Tamil, and Telugu using GPT-Realtime-Translate — a meaningful leverage point for operators serving low-resource Indian-language markets.

GL

Glean

Enterprise search vendor reporting a 42.9% relative increase in helpfulness over the prior model in internal voice evals, validating the model for knowledge-work voice agents.

Source Articles

Top 5

THE SIGNAL.

Analysts

"Notes that users increasingly turn to voice with AI when they need to 'dump' lots of context, and signals that ChatGPT voice improvements are coming behind the API release."

Sam Altman
CEO, OpenAI

"Frames real-time voice-to-voice translation as a long-anticipated OpenAI ambition that has been a goal since the company's early days."

Greg Brockman
Cofounder and President, OpenAI

"Cautions that the API release does not mean ChatGPT Voice Mode itself has been upgraded yet, though the consumer upgrade 'sounds' like it is coming soon."

Simon Willison
Independent developer and analyst

"Argues that real-time tool use, reasoning while speaking, and live translation are the trio that could finally make audio interfaces 'take off.'"

Will Depue
Researcher, OpenAI

"Reports that GPT-Realtime-Translate delivered 12.5% lower Word Error Rates than any other model BolnaAI tested, particularly on Indian languages."

Prateek Sachan
Co-founder and CTO, BolnaAI
The Crowd

"Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API"

@@OpenAI0

"OpenAIがRealtime APIに3つの音声モデルを追加。OpenAIがRealtime APIでGPT-Realtime-2、GPT-Realtime-Translate、GPT-Realtime-Whisperを発表。音声対話、リアルタイム翻訳、ストリーミング音声文字起こしをRealtime API上で扱える構成。"

@@LangChainJP0

"New Voice Model from OpenAI in the API gpt-realtime-2 Here a quick demo I built"

@@diegocabezas010

"New OpenAI Voice models: GPT-Realtime-2, Translate, and Whisper"

@u/Rollertoaster7206
Broadcast
Introducing gpt-realtime in the API

Introducing gpt-realtime in the API

We're introducing three audio models in the API

We're introducing three audio models in the API

GPT-Realtime-2: OpenAI's MOST Intelligent Voice Model Yet!

GPT-Realtime-2: OpenAI's MOST Intelligent Voice Model Yet!