Master real-time voice AI architecture: turn detection, streaming STT/LLM/TTS, native audio trade-offs, WebRTC transport, and barge-in state.
Diffusion Models & Image Generation focused on iterative generation under latency and quality trade-offs. Voice agents bring that pressure into live conversation: audio must stream, tools must resolve, and playback must stay interruptible.
A real-time voice AI agent has to listen, decide, speak, and recover quickly enough that the conversation still feels live. This design chapter covers speech pipelines, turn-taking, latency, tools, and fallback behavior.
Imagine calling customer support to track a delayed package. You say, "Where is my order?" and the voice on the other end pauses for two full seconds before answering. In those two seconds, you already wonder if the call dropped, or if the system heard you at all. That silence kills the illusion that you are talking to something intelligent.
Human conversation runs on split-second timing. Studies of turn-taking show that people minimize silence and overlap across languages, with response timing varying by only hundreds of milliseconds across communities.[1] If a voice AI agent takes much longer than that, the conversation feels broken, no matter how smart the underlying model is. Building a system that listens, thinks, and speaks fast enough to feel natural is one of the hardest engineering problems in modern AI.
This article walks through the architecture of a production-grade voice agent, using a concrete example: a package-tracking assistant for an online store. You will follow a single audio packet from the user's mouth to the model's "brain" and back to the speaker, learn why every millisecond matters, and see how the system recovers when the user interrupts mid-sentence. The concepts here apply to any real-time voice product, from customer support to warehouse coordination.
The primary difference between a text-based chatbot and a voice agent is the continuous flow of time. A text interface lets the user wait patiently while a spinner animates. In a voice interface, silence is a signal. If the agent takes too long to reply, the user assumes it didn't hear them and will likely repeat themselves, causing confusion and overlapping speech.
A useful product target for an interactive support call is a measured time-to-first-audio (TTFA) budget around 500ms for the network and device slices you intend to support. Treat that as a service objective to validate, not a universal boundary: endpoint policy, tools, model choice, and network conditions move the distribution. Research systems such as AudioPaLM and SpeechGPT[2][3] show where speech-native models can go, while cascaded or hybrid pipelines remain useful when transcript inspection, debugging, and tool control matter.
Before analyzing the architecture, it helps to picture the problem. Building a voice agent is like running a live routing line where each component must hand off partial work without waiting for the full case to finish:
If any stage drops packets or runs too slowly, the conversation breaks. The user experiences that as silence, overlap, or a response to a sentence they already corrected.
To make this concrete, imagine a customer named Alex who calls a voice support line for an online store. Alex says, "Where is my order number four-two-one-nine?"
Here is what happens to that audio, step by step, before Alex hears a response:
Each of these steps gets its own detailed look below. First, let's set the latency budget so you know what "fast enough" means in practice.
Designing a voice AI agent requires balancing functional capabilities with strict latency constraints. The system needs to understand and generate natural language fast enough to mimic human turn-taking.
10k+.Use 500ms as an example target, not a law of physics. A deployed route must report TTFA distributions rather than promise one illustrative .
Trace the ranges: each stage adds time, but streaming overlap allows read-only preparation before the final transcript and lets TTS start once an accepted turn has yielded a stable speakable clause.
Treat the 500ms number as a sequential accounting exercise. Good implementations hide some of that with streaming STT, streaming LLM output, and early TTS.
Let's walk through Alex's "Where is my order?" query with real numbers:
| Stage | Time | What's happening |
|---|---|---|
| Turn detection | 40-80ms | VAD decides Alex has stopped speaking |
| First STT partial | 50-150ms | Streaming decoder emits "Where is my order" while Alex is still talking |
| First LLM clause | 100-220ms | Model generates "Your order four-two-one-nine is" based on partial transcript |
| First TTS chunk | 80-120ms | Synthesizer turns that clause into the first 200ms of audio |
| Playout buffer | 20-40ms | Jitter buffer smooths network before the speaker |
If these ran one after another, the total would be about 500ms. But in practice:
In one candidate trace, this overlap may deliver first audio in 300-450ms. The production decision depends on measured p50/p95 TTFA, incorrect-commit rate, endpointing aggressiveness, model size, and network jitter.
1import json
2
3# One measured trace, in milliseconds after committed turn end.
4spans = {
5 "endpoint_commit": (0, 60),
6 "stt_finalization": (0, 100),
7 "llm_first_clause": (60, 240),
8 "tts_first_chunk": (210, 345),
9 "playout_buffer": (345, 380),
10}
11target_ms = 500
12first_audio_ms = spans["playout_buffer"][1]
13
14print(json.dumps({
15 "sequential_sum_ms": sum(end - start for start, end in spans.values()),
16 "first_audio_ms": first_audio_ms,
17 "target_ms": target_ms,
18 "meets_trace_target": first_audio_ms <= target_ms,
19 "ship_decision_requires_p95": True,
20}, indent=2))1{
2 "sequential_sum_ms": 510,
3 "first_audio_ms": 380,
4 "target_ms": 500,
5 "meets_trace_target": true,
6 "ship_decision_requires_p95": true
7}The architecture usually separates three concerns: client-side capture and playback, a latency-sensitive media edge, and the inference pipeline. Decoupling transport from inference matters because audio delivery has different failure modes from model execution. Packet loss, jitter, and echo cancellation must be handled on the media path even if the model stack is healthy.
In many deployments, a media gateway terminates WebRTC (Web Real-Time Communication) or SIP (Session Initiation Protocol), handles jitter buffers and routing, then forwards audio to inference over an internal low-latency link.[4] Not every system needs a full SFU (Selective Forwarding Unit) for a 1:1 assistant, but most serious deployments still benefit from a dedicated ingress tier close to the user.
Within inference, the voice pipeline orchestrates turn detection, STT or native-audio encoding, LLM reasoning, tool calls, and TTS or audio decoding. A conversation state manager sits alongside the pipeline, keeping track of dialogue turns, playback offsets, and tool results so the model only commits to what the user actually heard.
Follow the flow from left to right: capture, media edge, inference, and back to the speaker. Notice the barge-in path that instantly stops playback when the user interrupts.
Voice Activity Detection (VAD) determines when the user starts and stops speaking. In Alex's package-tracking call, VAD is what tells the system, "Alex is done asking the question. You can start answering now."
VAD is not the same thing as turn detection. VAD classifies each audio frame as speech or silence. Turn detection (endpointing) decides when the user's turn is actually finished so the agent can answer. Modern stacks increasingly layer three signals: VAD for raw speech presence, transcript-level endpointing for silence after stable words, and a model that judges semantic completeness from the partial transcript. Model-based end-of-turn detectors can commit a turn before trailing silence accumulates, cutting both clipped speakers and laggy responses.[5] You still keep a fast VAD underneath for instant barge-in.
Client-side VAD is common because it gives instant barge-in detection and avoids sending obvious silence. Server-side turn detection is also common because it centralizes tuning and can use richer context. The implementation below shows a generic local wrapper around a streaming VAD backend such as Silero VAD[6].
1from collections import deque
2from dataclasses import asdict, dataclass
3import json
4
5@dataclass
6class VADEvent:
7 type: str
8 frame: int
9 audio_ms: int
10
11class VoiceActivityDetector:
12 """Detect speech boundaries from one probability per 20ms audio frame."""
13
14 def __init__(
15 self,
16 frame_ms: int = 20,
17 speech_threshold: float = 0.5,
18 silence_duration_ms: int = 60,
19 prefix_padding_ms: int = 40,
20 ):
21 self.frame_ms = frame_ms
22 self.speech_threshold = speech_threshold
23 self.silence_frames_required = max(1, silence_duration_ms // frame_ms)
24 self.pre_roll = deque(maxlen=max(1, prefix_padding_ms // frame_ms))
25 self.speech_frames: list[int] = []
26 self.in_speech = False
27 self.trailing_silence = 0
28
29 def process_frame(self, frame_id: int, probability: float) -> VADEvent | None:
30 """Return speech boundary events for one audio frame."""
31 self.pre_roll.append(frame_id)
32 if probability >= self.speech_threshold:
33 if not self.in_speech:
34 self.in_speech = True
35 self.speech_frames = list(self.pre_roll)
36 self.trailing_silence = 0
37 return VADEvent("speech_start", frame_id, len(self.speech_frames) * self.frame_ms)
38 self.speech_frames.append(frame_id)
39 self.trailing_silence = 0
40 return None
41
42 if not self.in_speech:
43 return None
44
45 self.speech_frames.append(frame_id)
46 self.trailing_silence += 1
47 if self.trailing_silence < self.silence_frames_required:
48 return None
49
50 utterance_frames = self.speech_frames[:-self.trailing_silence]
51 self.in_speech = False
52 self.speech_frames = []
53 self.trailing_silence = 0
54 self.pre_roll.clear()
55 return VADEvent("speech_end", frame_id, len(utterance_frames) * self.frame_ms)
56
57probabilities = [0.04, 0.08, 0.66, 0.74, 0.69, 0.22, 0.15, 0.09]
58detector = VoiceActivityDetector()
59events = []
60
61for frame_id, probability in enumerate(probabilities):
62 event = detector.process_frame(frame_id, probability)
63 if event:
64 events.append(asdict(event))
65
66print(json.dumps(events, indent=2))1[
2 {
3 "type": "speech_start",
4 "frame": 2,
5 "audio_ms": 40
6 },
7 {
8 "type": "speech_end",
9 "frame": 7,
10 "audio_ms": 80
11 }
12]silence_duration_ms = 300 is a good starting point, not a universal truth. Phone support bots often go longer to avoid clipping slow speakers. Push-to-talk or assistant experiences can go shorter. A fixed silence threshold is the floor, not the ceiling: a semantic end-of-turn model can hold through mid-sentence pauses (a caller reciting a phone number digit by digit) and fire early once the utterance is clearly complete, so you do not have to pick one silence value that hurts every other case.[5]
Choosing where to run VAD changes responsiveness, bandwidth, and operational control. Client-side detection wins on local responsiveness. Server-side detection wins on centralized tuning and consistent turn boundaries across devices.
| Feature | Client-Side VAD | Server-Side Detection |
|---|---|---|
| Barge-in reaction | Can mute local playout without a network round trip | Includes transport delay before server action reaches playback |
| Bandwidth | Can suppress silence if upload policy uses its output | Receives the configured upstream stream continuously |
| Operational control | Requires testing across device/browser classes | Centralizes threshold and model tuning |
| Turn consistency | Device signals may differ | Can apply one server-side commit policy |
| Data exposure | Reduced only if capture/upload policy withholds silent frames | Depends on retained audio and server policy |
Many production systems use both: lightweight local VAD to mute playback instantly on barge-in, plus server-side turn detection to decide when transcripts are stable enough to trigger response generation.
The hybrid setup gives the user the fast interruption path they expect while keeping final turn boundaries consistent enough for transcripts, tools, and analytics.
A critical but often overlooked component is AEC, which prevents the agent from hearing itself. When the agent speaks and the user interrupts, the microphone picks up both the user's voice and the agent's own output playing from the speaker. Without AEC, the VAD would trigger on the agent's voice, creating an echo loop where the agent transcribes its own speech and responds to it.
Modern browser capture stacks expose echo-cancellation constraints so applications can request microphone input with system or remote audio removed before it reaches VAD and STT.[7] This happens in the time domain using adaptive filters that model the room's acoustic characteristics. The filter continuously adjusts to account for speaker placement, room reverberation, and device variations.
One common failure is deploying voice agents without AEC testing in real acoustic environments. Laptop speakers and microphones placed close together, as in mobile phones, create strong echo that untested AEC configurations may fail to suppress. The symptom is brutal: the agent interrupts itself.
Two primary approaches exist for real-time STT. In Alex's call, streaming STT is what lets the system start thinking about "Where is my order" before Alex fully finishes saying "order."
Streaming STT sends partial transcripts as audio arrives, allowing the LLM to pre-read the turn and start cancellable, read-only preparation before the utterance is complete. Provider SDKs differ, but the state machine must separate unstable interim text from committed text that may enter history, trigger tools, or reach TTS.
1from dataclasses import asdict, dataclass
2import json
3
4@dataclass
5class TranscriptEvent:
6 kind: str
7 text: str
8 stable_prefix: str
9
10def emit_transcript_events(partials: list[tuple[str, bool]]) -> list[TranscriptEvent]:
11 events = []
12
13 for text, is_final in partials:
14 words = text.split()
15 stable_prefix = " ".join(words[:-1]) if not is_final else text
16
17 if is_final:
18 events.append(TranscriptEvent("final", text, stable_prefix))
19 else:
20 events.append(TranscriptEvent("interim", text, stable_prefix))
21
22 return events
23
24partials = [
25 ("Where is", False),
26 ("Where is my order", False),
27 ("Where is my order number four two", False),
28 ("Where is my order number four two one nine?", True),
29]
30
31print(json.dumps([asdict(event) for event in emit_transcript_events(partials)], indent=2))1[
2 {
3 "kind": "interim",
4 "text": "Where is",
5 "stable_prefix": "Where"
6 },
7 {
8 "kind": "interim",
9 "text": "Where is my order",
10 "stable_prefix": "Where is my"
11 },
12 {
13 "kind": "interim",
14 "text": "Where is my order number four two",
15 "stable_prefix": "Where is my order number four"
16 },
17 {
18 "kind": "final",
19 "text": "Where is my order number four two one nine?",
20 "stable_prefix": "Where is my order number four two one nine?"
21 }
22]The commit boundary matters most when a partial transcript changes meaning. A package lookup may be safe to prefetch and discard; canceling an order or promising an ETA is not.
1import json
2
3def permitted_actions(text: str, committed: bool) -> list[str]:
4 actions = ["classify_intent"]
5 if "order" in text.lower():
6 actions.append("prefetch_read_only_status")
7 if committed:
8 actions.extend(["write_history", "speak_response"])
9 if "cancel" in text.lower():
10 actions.append("ask_for_cancel_confirmation")
11 return actions
12
13events = [
14 {"text": "Cancel order four two one nine", "committed": False},
15 {"text": "Track order four two one nine", "committed": True},
16]
17
18print(json.dumps([
19 {"committed": event["committed"], "actions": permitted_actions(**event)}
20 for event in events
21], indent=2))1[
2 {
3 "committed": false,
4 "actions": [
5 "classify_intent",
6 "prefetch_read_only_status"
7 ]
8 },
9 {
10 "committed": true,
11 "actions": [
12 "classify_intent",
13 "prefetch_read_only_status",
14 "write_history",
15 "speak_response"
16 ]
17 }
18]Using a batch recognizer such as Whisper[8] after speech_end is much simpler, but it adds an extra decode step after the user stops talking.
1import json
2
3def batch_stt_timeline(speech_ms: int, endpoint_ms: int, decode_ms: int) -> dict[str, int | str]:
4 speech_end_ms = speech_ms + endpoint_ms
5 transcript_ready_ms = speech_end_ms + decode_ms
6 return {
7 "approach": "end-of-utterance",
8 "speech_ms": speech_ms,
9 "endpoint_ms": endpoint_ms,
10 "decode_ms": decode_ms,
11 "first_transcript_ms": transcript_ready_ms,
12 }
13
14print(json.dumps(batch_stt_timeline(speech_ms=1200, endpoint_ms=300, decode_ms=450), indent=2))1{
2 "approach": "end-of-utterance",
3 "speech_ms": 1200,
4 "endpoint_ms": 300,
5 "decode_ms": 450,
6 "first_transcript_ms": 1950
7}| Approach | Latency | Complexity |
|---|---|---|
| Streaming STT | First partial often arrives in ~50-150ms | High |
| End-of-utterance | Adds ~200-800ms+ after speech_end | Low |
After turn commitment, the LLM should emit a stable first speakable clause without waiting for a long response plan. Before commitment, the system may only do reversible preparation. This makes Time-to-First-Token (TTFT), the duration it takes for the LLM to produce its first output token, relevant without allowing an unstable transcript to become spoken truth; the serving mechanics are covered in Inference Mechanics.
Long inference-time reasoning and slow tools are poor candidates for a low-latency acknowledgement path unless measurements meet its TTFA objective. A voice agent can speak an honest acknowledgement quickly, run complex work asynchronously, and deliver the result when ready instead of pretending every question has an instant answer.
The helper below shows the voice-specific shape: short clauses, no Markdown, tool status stated early, and a stream of text chunks that TTS can consume before the full response is complete.
1import json
2
3def voice_chunks(transcript: str, order_status: dict[str, str]) -> list[str]:
4 reply = (
5 f"Your order {order_status['id']} is delayed. "
6 f"It should arrive {order_status['eta']}. "
7 "I can text you the tracking link."
8 )
9 clauses = [part.strip() + "." for part in reply.split(".") if part.strip()]
10 return clauses
11
12transcript = "Where is my order number four two one nine?"
13chunks = voice_chunks(
14 transcript=transcript,
15 order_status={"id": "4219", "eta": "tomorrow"},
16)
17
18print(json.dumps({
19 "contains_order": "order" in transcript.lower(),
20 "max_words_per_clause": max(len(clause.split()) for clause in chunks),
21 "streamed_clauses": chunks,
22}, indent=2))1{
2 "contains_order": true,
3 "max_words_per_clause": 7,
4 "streamed_clauses": [
5 "Your order 4219 is delayed.",
6 "It should arrive tomorrow.",
7 "I can text you the tracking link."
8 ]
9}Prompt engineering for voice is distinct from text. Voice prompts must explicitly forbid Markdown, which TTS reads poorly, and encourage front-loading the answer so meaningful audio is generated immediately.
Live voice sessions stress inference differently from chat. TTFT matters more than long-form throughput because TTS can't speak until the first stable clause arrives. Session length also matters because a 20-minute call can accumulate enough context to make KV-cache management part of the latency budget, not just an infrastructure detail.
The key optimization is to begin speaking before the full response is generated. The TTS engine receives a stream of text tokens, buffers them into sentences or phrases, and synthesizes audio chunks on the fly.
Waiting for the LLM to finish an entire sentence or paragraph before starting TTS can add seconds of latency. Stream tokens and synthesize clause-by-clause instead.
The function below demonstrates buffering text into short speakable clauses. Waiting for a full paragraph is too slow, but firing on every token sounds choppy.
1import json
2
3def flush_speakable_clauses(tokens: list[str], max_chars: int = 52) -> list[str]:
4 buffer = ""
5 flushed = []
6
7 for token in tokens:
8 buffer += token
9 stripped = buffer.strip()
10 if stripped.endswith((",", ".", "!", "?")) or len(stripped) >= max_chars:
11 flushed.append(stripped)
12 buffer = ""
13
14 if buffer.strip():
15 flushed.append(buffer.strip())
16
17 return flushed
18
19tokens = [
20 "Your ", "order ", "4219 ", "is ", "delayed, ",
21 "but ", "it ", "should ", "arrive ", "tomorrow. ",
22 "I ", "can ", "text ", "the ", "tracking ", "link."
23]
24
25print(json.dumps(flush_speakable_clauses(tokens), indent=2))1[
2 "Your order 4219 is delayed,",
3 "but it should arrive tomorrow.",
4 "I can text the tracking link."
5]The 500ms budget represents the theoretical sequential sum of all components, but the production pipeline overlaps stages to fit well under that target. Here is a realistic overlap:
The illustrated trace lands around 300-450ms time to first audio (TTFA) on its assumed healthy path. Real TTFA depends on endpointing errors, model size, tool routing, and jitter-buffer depth; report its distribution for target network slices.
The key idea is overlap, not magic speed. VAD, STT, LLM, TTS, and playout each has latency, but the first audible word can arrive before every downstream job has fully completed.
When the user speaks over the agent ("barge-in"), the system must react instantly to maintain the illusion of a natural conversation. Imagine Alex asks about an order, the agent starts explaining the return policy, and Alex interrupts with "Actually, just cancel it." The agent must stop immediately.
The hard part isn't stopping the speaker. The hard part is making the next model call remember only the part of the assistant response that crossed the speaker boundary.
The handler below shows the core reconciliation logic. When VAD detects fresh user speech during agent playback, the system must stop playout, cancel in-flight generation, and truncate assistant text to the portion that actually reached the speaker.
1from dataclasses import dataclass
2import json
3
4@dataclass
5class PlayedChunk:
6 text: str
7 text_end_char: int
8 playback_end_ms: int
9
10class InterruptionHandler:
11 def __init__(self, pending_text: str, played_chunks: list[PlayedChunk]):
12 self.pending_text = pending_text
13 self.played_chunks = played_chunks
14
15 def reconcile(self, interrupt_ms: int) -> dict[str, object]:
16 heard_chars = self._chars_heard(interrupt_ms)
17 spoken_text = self.pending_text[:heard_chars].rstrip()
18 discarded_text = self.pending_text[heard_chars:].strip()
19 return {
20 "history_to_keep": [{"role": "assistant", "content": spoken_text}],
21 "discarded_generated_text": discarded_text,
22 }
23
24 def _chars_heard(self, interrupt_ms: int) -> int:
25 heard_chars = 0
26 for chunk in self.played_chunks:
27 if chunk.playback_end_ms <= interrupt_ms:
28 heard_chars = chunk.text_end_char
29 else:
30 break
31 return heard_chars
32
33pending = "Your order is delayed because the carrier missed pickup."
34chunks = [
35 PlayedChunk("Your order", 10, 240),
36 PlayedChunk(" is delayed", 21, 520),
37 PlayedChunk(" because the carrier", 41, 860),
38 PlayedChunk(" missed pickup.", 56, 1120),
39]
40
41result = InterruptionHandler(pending, chunks).reconcile(interrupt_ms=650)
42print(json.dumps(result, indent=2))1{
2 "history_to_keep": [
3 {
4 "role": "assistant",
5 "content": "Your order is delayed"
6 }
7 ],
8 "discarded_generated_text": "because the carrier missed pickup."
9}In production, map synthesized chunks to actual playout timestamps or RTP sequence ranges, not to model tokens alone. Jitter buffers can delay or drop chunks after synthesis, so "generated" is not the same as "heard."
Design reviews often ask you to compare two design families: cascaded STT -> LLM -> TTS pipelines and native speech-to-speech models. Early research systems such as AudioPaLM and SpeechGPT showed the idea was viable.[2][3] As of May 28, 2026, native audio is a production API option: OpenAI documents gpt-realtime for realtime audio input and output,[11] and Google documents Gemini Live native-audio sessions with voice activity detection and tool use.[12] Native models reduce model handoffs and can retain prosodic signals that a text-only boundary loses, but they do not remove transport, VAD, AEC, or state-management problems.
Use the comparison as a control checklist. Native audio can shorten the model path, but you still need stateful sessions, transcript side channels, tool policy, and interruption handling.
Many commercial native-audio APIs use stateful sessions, not one-shot request/response calls. The client streams audio frames in, then receives incremental audio and transcript events back.
1from collections import defaultdict
2import json
3
4events = [
5 {"type": "transcript_delta", "text": "Your order"},
6 {"type": "audio_delta", "audio_ms": 160},
7 {"type": "transcript_delta", "text": " is delayed."},
8 {"type": "audio_delta", "audio_ms": 220},
9 {"type": "tool_call", "name": "send_tracking_link"},
10]
11
12state = defaultdict(list)
13played_audio_ms = 0
14
15for event in events:
16 if event["type"] == "audio_delta":
17 played_audio_ms += event["audio_ms"]
18 elif event["type"] == "transcript_delta":
19 state["transcript"].append(event["text"])
20 elif event["type"] == "tool_call":
21 state["tool_calls"].append(event["name"])
22
23print(json.dumps({
24 "transcript_side_channel": "".join(state["transcript"]),
25 "played_audio_ms": played_audio_ms,
26 "tool_calls": state["tool_calls"],
27}, indent=2))1{
2 "transcript_side_channel": "Your order is delayed.",
3 "played_audio_ms": 380,
4 "tool_calls": [
5 "send_tracking_link"
6 ]
7}Many native audio systems bypass feeding raw PCM directly into the transformer. They first compress audio into acoustic tokens or other learned latent representations, then run the sequence model over those shorter units.
Generation works in reverse: the model emits audio tokens or latents, a decoder reconstructs waveform chunks, and the client plays them with a jitter buffer. System-design point: compressed speech representations make sequence modeling practical while preserving tone, pacing, and other paralinguistic cues.
Native audio can avoid some model handoffs and retain paralinguistic features such as tone, hesitation, and emphasis that a strict text bottleneck removes. Whether it produces faster or better interactions is an evaluation result, not an architectural guarantee.
However, native audio is harder to debug and harder to control. Without a clean intermediate transcript, it's tougher to inspect model failure modes, run deterministic tool policies, or align partially spoken output with conversation history. Modular pipelines are still easier to observe, test, and swap component-by-component.
When moving live audio across the internet, transport choice sets baseline latency and jitter. For browser and mobile voice agents, WebRTC prefers a media-timed UDP path when available, but ICE/TURN deployments need relay and TCP/TLS fallback paths for restricted networks.[4] WebSockets (persistent, full-duplex TCP connections) remain useful for signaling, transcription-only streams, or server-to-server audio when you are willing to manage media buffering yourself.
TCP guarantees delivery by acknowledging every packet. That's great for text and file transfer, but it creates head-of-line blocking for live audio. If one packet is lost, later packets wait behind the retransmission. In voice, that often shows up as stuttery playout or latency spikes in the hundreds of milliseconds.
On a WebRTC UDP media path, loss handling is deadline-aware: the stream can keep moving while packet-loss concealment fills short gaps. For conversation, concealing a short lost frame may be preferable to delaying later audio for recovery; verify this trade-off under the network slices you serve.
The difference is a deadline. RTP recovery helps only while a packet can still improve playout; TCP recovery must preserve byte order even if the voice moment has already passed.
Network jitter (variance in packet arrival time) is the enemy of smooth audio. A jitter buffer on the receiver side (media server or client) holds incoming packets for a short duration before playback. Around 20-60ms is a common starting range, but the right value depends on network quality and how much delay your UX can tolerate.
Browser WebRTC stacks already include adaptive jitter buffers and packet loss concealment (PLC). NetEQ is a well-known WebRTC receiver implementation of this idea.[13] When packets arrive late or disappear, the receiver can stretch buffered audio, extrapolate from recent audio, or ask the decoder to synthesize concealment frames instead of stalling playout.
| Feature | WebRTC Media (UDP Preferred, Fallbacks Possible) | WebSocket (TCP) |
|---|---|---|
| Transport | UDP | TCP |
| Reliability Model | Media-timed (may use NACK/FEC, but playout stays time-bounded) | Byte-stream reliable (lost bytes block later bytes) |
| Latency Behavior | Low and stable when tuned | Can spike under loss |
| Congestion Control | Media-aware adaptation via RTP/RTCP feedback (implementation-dependent) | Managed by the TCP stack, not by media playout deadlines |
| Encryption | Mandatory (DTLS (Datagram Transport Layer Security) / SRTP (Secure Real-time Transport Protocol)) | TLS (over TCP) |
| Ideal Use Case | Real-time Voice/Video | Chat, Signaling, File Transfer |
WebRTC media is richer than "UDP means drop packets." On UDP media paths it layers RTP (Real-time Transport Protocol) / RTCP (RTP Control Protocol) feedback, jitter buffers, packet loss concealment (PLC), and sometimes NACK (Negative Acknowledgment) or FEC (Forward Error Correction). ICE and TURN can select relay or fallback paths when direct UDP is unavailable.[4] The design goal is timeliness: recover loss when it can still help, otherwise keep audio moving.
Using WebSockets as a drop-in replacement for WebRTC in browser live audio creates hidden work. You can ship with WebSockets, but you must own chunking, playout buffering, backpressure, and worse behavior under packet loss.
Designing voice AI for scale introduces challenges around state, media routing, and bursty GPU demand. For a system serving 10,000+ concurrent sessions, the ingress tier usually needs sticky routing or consistent hashing so packets for one conversation keep landing on the same media worker.
The media tier scales by active sessions and network state. The inference tier scales by model pressure, which is why separating these tiers usually beats one giant "voice server" process.
WebRTC connections are deeply stateful. A specific media worker holds Datagram Transport Layer Security (DTLS) keys, Secure Real-time Transport Protocol (SRTP) state, jitter-buffer state, and often playback cursor. If a naive round-robin load balancer routes packets to a different worker mid-session, decryption and timing state break immediately.
Behind media workers, inference components should scale independently because their compute profiles differ. Turn detection and some STT stages often fit on CPU. Largest speech or LLM models usually need GPU. Independent scaling keeps GPU utilization high without overprovisioning lighter tiers.
1import hashlib
2import json
3
4workers = ["media-a", "media-b", "media-c"]
5
6def route(session_id: str) -> str:
7 digest = int(hashlib.sha256(session_id.encode()).hexdigest(), 16)
8 return workers[digest % len(workers)]
9
10session_id = "call-alex-4219"
11packet_routes = [route(session_id) for _ in range(4)]
12failed_worker = packet_routes[0]
13
14print(json.dumps({
15 "packet_routes": packet_routes,
16 "single_worker_during_session": len(set(packet_routes)) == 1,
17 "failed_worker": failed_worker,
18 "recovery": "reconnect_and_rebuild_session",
19}, indent=2))1{
2 "packet_routes": [
3 "media-c",
4 "media-c",
5 "media-c",
6 "media-c"
7 ],
8 "single_worker_during_session": true,
9 "failed_worker": "media-c",
10 "recovery": "reconnect_and_rebuild_session"
11}At this point you should be able to defend the full live loop: how first audio arrives fast, how barge-in trims history to heard audio, and why media routing must stay stateful even when model pools scale independently.
These are the failure patterns that most often break the voice illusion. In a design review, name the symptom, cause, and fix instead of blaming the model generically.
Symptom: The agent pauses for one or two seconds before it even starts thinking. Cause: You batch-transcribed the whole utterance instead of using streaming STT partials. Fix: Emit stable early text so the LLM can pre-read the turn before the final transcript lands.
Symptom: The answer is correct, but the long silence before speech makes the system feel dead. Cause: You waited for full LLM completion before sending anything to TTS. Fix: Buffer text into short speakable clauses and flush them as soon as they are stable enough to say aloud.
Symptom: The agent speaks an incorrect order number or starts an action the caller immediately corrects. Cause: You treated an unstable STT partial as a committed turn. Fix: Limit partial-transcript work to reversible preparation; authorize speech and side effects only after turn commitment and required confirmation.
Symptom: Slow speakers get clipped, or every caller feels like they are talking to a laggy robot. Cause: Endpointing is tuned backward for the call type. Fix: Tune silence thresholds and semantic end-of-turn rules against false cutoffs, barge-in latency, and time-to-first-audio.
Symptom: After an interruption, the agent answers as if the user heard words that never played. Cause: You stored generated text instead of playback-confirmed text. Fix: Align synthesized chunks to playout timestamps or RTP ranges and keep only the heard prefix in history.
Symptom: Browser audio works in demos but stalls badly when the network drops packets. Cause: You treated WebSockets as a drop-in replacement for WebRTC media. Fix: Prefer WebRTC for live browser audio, or explicitly own buffering, backpressure, and loss behavior yourself.
Symptom: The agent sounds clever after several seconds, but conversation feels broken. Cause: You put deep reasoning or heavy tool latency directly in the live voice loop. Fix: Keep the live model fast and push expensive reasoning into async tools, follow-up workflows, or background actions.
The questions below belong in the article, not in metadata, because they are part of the lesson. A reader should be able to answer them from the architecture they just built up.
The best way to internalize this architecture is to build something small and measurable. Here are three progressive projects you can add to a portfolio:
Voice Mirror: A simple agent that transcribes what you say and speaks it back in a different voice. This teaches you how to wire VAD, STT, and TTS together without the complexity of LLM reasoning.
Smart Store Controller: A voice agent that can look up mock order status and initiate returns via function calls. This adds tool use and teaches you how latency changes when the LLM must call an API before responding.
Roleplay Support Agent: A voice agent that uses a clause buffer and records TTFA distributions while giving multi-sentence responses about delivery exceptions or refund policies. This teaches you to tune silence_duration_ms, clause flushing, and barge-in handling against a declared objective.
Building a voice agent with a tight first-audio objective requires measuring every handoff across the processing pipeline. The architecture must be resilient to network jitter while handling the unpredictability of human conversation.
You should now be able to diagram the lifecycle of a single voice packet from microphone to model and back, calculate the cumulative latency of a pipeline and identify the bottleneck, implement basic barge-in logic, and discuss the trade-offs between cascaded and native speech-to-speech models.
Universals and cultural variation in turn-taking in conversation.
Stivers, T., et al. · 2009 · PNAS
AudioPaLM: A Large Language Model That Can Speak and Listen.
Rubenstein, P. K., et al. · 2023 · arXiv preprint
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities.
Zhang, D., et al. · 2023 · arXiv preprint
Media Transport and Use of RTP in WebRTC
Perkins, C., Westerlund, M., & Ott, J. · 2021 · IETF RFC 8834
Turn Detection for Voice Agents: VAD, Endpointing, and Model-Based Detection
Hall, J. (LiveKit) · 2026
Silero VAD: pre-trained enterprise-grade Voice Activity Detector.
Silero Team · 2021
Media Capture and Streams
World Wide Web Consortium · 2026 · W3C Editor's Draft
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.
Radford, A., et al. · 2022 · arXiv preprint
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023
Introducing gpt-realtime and Realtime API updates for production voice agents
OpenAI · 2025
Gemini Live API overview
Google · 2026
NetEq
WebRTC Project · 2026 · Chromium WebRTC documentation