Why streaming beats turn-by-turn: the few-seconds-behind trick
The core engineering bet is latency. Older translation systems are turn-by-turn: they wait for the speaker to finish a sentence, then translate, producing the awkward stop-start cadence familiar from conference interpreting apps. Gemini 3.5 Live Translate instead generates translated speech continuously, explicitly balancing the trade-off between waiting for more context (which improves quality) and translating immediately (which keeps it in sync) [1]. The result, per Google, is fluid audio that stays only a few seconds behind the speaker throughout a session [1]. Just as important as speed is prosody preservation: the model auto-detects 70+ languages and reproduces the original speaker's intonation, pacing, and pitch rather than flattening everything into a robotic monotone [1]. The model card situates this on serious infrastructure -- built on Gemini 3 Pro, with up to a 128K-token audio input context window and up to 64K-token output [2].




