LearnSystem Design CapstonesReal-Time Voice AI Agent

🏗️HardSystem Design

Real-Time Voice AI Agent

Master real-time voice AI architecture: turn detection, streaming STT/LLM/TTS, native audio trade-offs, WebRTC transport, and barge-in state.

41 min read

Learning path

Step 153 of 158 in the full curriculum

Diffusion Models: Images & Text Reasoning & Test-Time Compute

Diffusion Models & Image Generation focused on iterative generation under latency and quality trade-offs. Voice agents bring that pressure into live conversation: audio must stream, tools must resolve, and playback must stay interruptible.

A real-time voice AI agent has to listen, decide, speak, and recover quickly enough that the conversation still feels live. This design chapter covers speech pipelines, turn-taking, latency, tools, and fallback behavior.

An incident hotline feels broken if a user asks, "What is incident four two one nine?" and the voice on the other end pauses for two full seconds before answering. In those two seconds, the caller already wonders if the call dropped, or if the system heard the question at all. That silence kills the illusion that you're talking to something intelligent.

Human conversation runs on split-second timing. Studies of turn-taking show that people minimize silence and overlap across languages, with response timing varying by only hundreds of milliseconds across communities.^{[1]Reference 1Universals and cultural variation in turn-taking in conversation.https://www.mpi.nl/publications/item66202/universals-and-cultural-variation-turn-taking-conversation} If a voice AI agent takes much longer than that, the conversation feels broken, no matter how smart the underlying model is. Building a system that listens, thinks, and speaks fast enough to feel natural is one of the hardest engineering problems in modern AI.

The concrete design is an incident-status assistant for an operations team. Trace one audio packet from the user's mouth to the model path and back to the speaker, then watch how the system recovers when the user interrupts mid-sentence. The same architecture choices apply to real-time voice products from IT help desks to emergency operations.

The latency illusion

The primary difference between a text-based chatbot and a voice agent is the continuous flow of time. A text interface lets the user wait patiently while a spinner animates. In a voice interface, silence is a signal. If the agent takes too long to reply, the user assumes it didn't hear them and will likely repeat themselves, causing confusion and overlapping speech.

A useful product target for an interactive support call is a measured time-to-first-audio (TTFA) budget around 500ms for the network and device slices you intend to support. Treat that as a service objective to validate, not a universal boundary: endpoint policy, tools, model choice, and network conditions move the distribution. Research systems such as AudioPaLM and SpeechGPT^{[2]Reference 2AudioPaLM: A Large Language Model That Can Speak and Listen.https://arxiv.org/abs/2306.12925}^{[3]Reference 3SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities.https://arxiv.org/abs/2305.11000} show where speech-native models can go, while cascaded or hybrid pipelines remain useful when transcript inspection, debugging, and tool control matter.

A live routing line

Picture the problem before the architecture details. Building a voice agent is like running a live routing line where each component must hand off partial work without waiting for the full case to finish:

Voice Activity Detection (VAD): Opens and closes the intake gate when speech starts and ends.
Speech-to-Text (STT): Converts the incoming audio stream into partial text.
LLM: Uses stable transcript segments to plan the response.
Text-to-Speech (TTS): Starts speaking as soon as enough response text is available.
Orchestrator: Coordinates handoffs, interruptions, and latency budgets.

If any stage drops packets or runs too slowly, the conversation breaks. The user experiences that as silence, overlap, or a response to a sentence they already corrected.

One packet through the incident-status agent

An engineer named Alex calls an incident hotline and says, "What is incident number four-two-one-nine?"

The audio passes through four steps before Alex hears a response:

Listen (Voice Activity Detection): The system detects when Alex starts speaking and when Alex stops. VAD gates speech boundaries; a separate noise-suppression stage reduces noise mixed with speech.
Transcribe (Speech-to-Text): The audio is converted to text. This must happen while Alex is still speaking (streaming), rather than waiting until Alex finishes.
Think (Large Language Model): The model may prepare from stable partial transcripts, but the orchestrator must not speak a factual answer or execute a side effect until turn-commit policy accepts the input. Once committed, it streams the answer in speakable chunks.
Speak (Text-to-Speech): The voice synthesis engine reads the first few words as soon as they arrive from the LLM, without waiting for the full sentence.

Each of these steps gets its own detailed look below. First, set the latency budget so you know what "fast enough" means in practice.

The latency budget: a worked example

Designing a voice AI agent requires balancing functional capabilities with strict latency constraints. The system needs to understand and generate natural language fast enough to mimic human turn-taking.

Functional requirements

Real-time bidirectional conversation: Near-full-duplex experience where the user can interrupt at any time.
Natural language understanding: Processing speech input into actionable intent.
LLM reasoning: Generating context-aware responses and executing tools (incident lookup, severity summary).
Natural speech synthesis: Generating audio with human-like prosody and emotion.
Interruption handling (barge-in): The agent must stop speaking immediately when the user interrupts.

Example service objectives

Speech-end to first audio: Track p50 and p95 from the acoustic end of user speech to first played assistant audio. This is the user-visible end-to-end clock; an operations line might start with a p95 target of 500ms on declared network slices.
Commit to first audio: Track the internal clock from accepted turn commitment to first played audio. A route might start with a p95 target of 400ms. This excludes endpointing delay and makes post-commit LLM, TTS, and playout regressions easier to isolate.
Mute on barge-in: Track user-speech onset to muted playback; a product might set a p95 target of 100ms.
Concurrency: Size media and inference tiers from forecast active sessions and measured per-session work, rather than assuming 10k+.
Audio path: Negotiate supported codecs and sample rates, then measure jitter, loss, and intelligibility by device class.
Availability: Choose an error budget for the call product and include media-connect, model, and tool failure modes.

Latency budget breakdown

Use 500ms speech-end TTFA and 400ms commit-to-audio as example targets, not laws of physics. A deployed route must report both distributions rather than promise one illustrative trace.

Voice latency timeline anchored at acoustic speech end, with turn commit at 60 milliseconds, streaming STT already active, first audio at 380 milliseconds, and separate 380-millisecond speech-end and 320-millisecond commit clocks. — The user clock starts at acoustic speech end; the internal clock starts later at turn commit. Streaming work overlaps both.

Trace the ranges from one shared event: acoustic speech end at 0ms. Streaming STT has already been consuming frames during the utterance, so only its finalization tail remains after speech ends. Read-only preparation may also be warm before commit. TTS starts only after an accepted turn yields a stable speakable clause.

Keep the two clocks separate. Speech-end TTFA includes endpointing and commit delay. Commit-to-audio starts at the accepted boundary and measures the internal response path. Streaming STT begins during speech rather than waiting for either clock.

Alex's "What is incident status?" query shows the timing path with real numbers:

Stage	Position relative to acoustic speech end	What's happening
Streaming STT partials	Begin during speech, often before `0ms`	Decoder emits "What is incident" while Alex is still talking
Turn commit	`0-60ms`	Endpoint policy accepts that Alex's turn is complete
STT finalization tail	`0-100ms`	Final transcript catches up with audio already streamed during speech
First accepted LLM clause	`60-240ms`	Post-commit generation emits a stable clause; reversible preparation may have started earlier
First TTS chunk	`210-345ms`	Synthesizer turns the stable clause into playable audio
Playout buffer	`345-380ms`	Jitter buffer smooths network variance before the speaker

If the post-speech spans ran one after another, their durations would add to about 510ms. In practice:

STT started decoding before Alex finished the word "incident," so its full decoding cost isn't added after speech end.
The orchestrator can pre-classify intent or start a cancellable read-only lookup from stable words; it can't commit an incident number, speak a result, or perform a side effect from an unstable transcript.
After turn commitment, TTS starts synthesizing as soon as the LLM produces a stable clause like "I found that incident."

In the candidate trace below, first audio plays 380ms after acoustic speech end and 320ms after turn commit. The production decision depends on measured p50/p95 for both clocks, incorrect-commit rate, endpointing aggressiveness, model size, and network jitter.

accounting-for-overlapped-first-audio.py

import json

# One measured trace, in milliseconds after acoustic speech end.
spans = {
    "endpoint_commit": (0, 60),
    "stt_finalization": (0, 100),
    "llm_first_clause": (60, 240),
    "tts_first_chunk": (210, 345),
    "playout_buffer": (345, 380),
}
speech_end_ms = 0
turn_commit_ms = spans["endpoint_commit"][1]
first_audio_ms = spans["playout_buffer"][1]
speech_end_target_ms = 500
commit_target_ms = 400

print(json.dumps({
    "sequential_sum_ms": sum(end - start for start, end in spans.values()),
    "speech_end_to_first_audio_ms": first_audio_ms - speech_end_ms,
    "commit_to_first_audio_ms": first_audio_ms - turn_commit_ms,
    "speech_end_target_ms": speech_end_target_ms,
    "commit_target_ms": commit_target_ms,
    "meets_speech_end_trace_target": first_audio_ms - speech_end_ms <= speech_end_target_ms,
    "meets_commit_trace_target": first_audio_ms - turn_commit_ms <= commit_target_ms,
    "ship_decision_requires_p95": True,
}, indent=2))

Output

{
  "sequential_sum_ms": 510,
  "speech_end_to_first_audio_ms": 380,
  "commit_to_first_audio_ms": 320,
  "speech_end_target_ms": 500,
  "commit_target_ms": 400,
  "meets_speech_end_trace_target": true,
  "meets_commit_trace_target": true,
  "ship_decision_requires_p95": true
}

How the pieces fit together

The architecture usually separates three concerns: client-side capture and playback, a latency-sensitive media edge, and the inference pipeline. Decoupling transport from inference matters because audio delivery has different failure modes from model execution. Packet loss, jitter, and echo cancellation must be handled on the media path even if the model stack is healthy.

In many deployments, a media gateway terminates WebRTC (Web Real-Time Communication) or SIP (Session Initiation Protocol), handles jitter buffers and routing, then forwards audio to inference over an internal low-latency link.^{[4]Reference 4Media Transport and Use of RTP in WebRTChttps://www.rfc-editor.org/info/rfc8834/} Not every system needs a full SFU (Selective Forwarding Unit) for a 1:1 assistant, but most serious deployments still benefit from a dedicated ingress tier close to the user.

Within inference, the voice pipeline orchestrates turn detection, STT or native-audio encoding, LLM reasoning, tool calls, and TTS or audio decoding. A conversation state manager sits alongside the pipeline, keeping track of dialogue turns, playback offsets, and tool results so the model only commits to what the user heard.

Real-time voice agent where one hot audio path runs from mic to playback while separate interruption and truth loops decide what to mute now and what counts as truly heard. — Voice agents feel fast when the speech path stays short and reliable, while interruption control and truth reconciliation stay out of that hot loop.

Follow the flow from left to right: capture, media edge, inference, and back to the speaker. Notice the barge-in path that instantly stops playback when the user interrupts.

Knowing when to listen: voice activity detection

Voice Activity Detection (VAD) determines when the user starts and stops speaking. In Alex's incident-status call, VAD is what tells the system, "Alex is done asking the question. You can start answering now."

VAD isn't the same thing as turn detection. VAD classifies each audio frame as speech or silence. Turn detection (endpointing) decides when the user's turn has finished so the agent can answer. Modern stacks increasingly layer three signals: VAD for raw speech presence, transcript-level endpointing for silence after stable words, and a model that judges semantic completeness from the partial transcript. Model-based end-of-turn detectors can commit a turn before trailing silence accumulates, cutting both clipped speakers and laggy responses.^{[5]Reference 5Turn Detection for Voice Agents: VAD, Endpointing, and Model-Based Detectionhttps://livekit.com/blog/turn-detection-voice-agents-vad-endpointing-model-based-detection} You still keep a fast VAD underneath for instant barge-in.

Client-side VAD is common because it gives instant barge-in detection and avoids sending obvious silence. Server-side turn detection is also common because it centralizes tuning and can use richer context. This local wrapper represents a streaming VAD backend such as Silero VAD^{[6]Reference 6Silero VAD: pre-trained enterprise-grade Voice Activity Detector.https://github.com/snakers4/silero-vad}.

knowing-when-to-listen-voice-activity.py

from collections import deque
from dataclasses import asdict, dataclass
import json

@dataclass
class VADEvent:
    type: str
    frame: int
    audio_ms: int

class VoiceActivityDetector:
    """Detect speech boundaries from one probability per 20ms audio frame."""

    def __init__(
        self,
        frame_ms: int = 20,
        speech_threshold: float = 0.5,
        silence_duration_ms: int = 60,
        prefix_padding_ms: int = 40,
    ):
        self.frame_ms = frame_ms
        self.speech_threshold = speech_threshold
        self.silence_frames_required = max(1, silence_duration_ms // frame_ms)
        self.pre_roll = deque(maxlen=max(1, prefix_padding_ms // frame_ms))
        self.speech_frames: list[int] = []
        self.in_speech = False
        self.trailing_silence = 0

    def process_frame(self, frame_id: int, probability: float) -> VADEvent | None:
        """Return speech boundary events for one audio frame."""
        if probability >= self.speech_threshold:
            if not self.in_speech:
                self.in_speech = True
                self.speech_frames = list(self.pre_roll)
                self.speech_frames.append(frame_id)
                self.trailing_silence = 0
                return VADEvent("speech_start", frame_id, len(self.speech_frames) * self.frame_ms)
            self.speech_frames.append(frame_id)
            self.trailing_silence = 0
            return None

        if not self.in_speech:
            self.pre_roll.append(frame_id)
            return None

        self.speech_frames.append(frame_id)
        self.trailing_silence += 1
        if self.trailing_silence < self.silence_frames_required:
            return None

        utterance_frames = self.speech_frames[:-self.trailing_silence]
        self.in_speech = False
        self.speech_frames = []
        self.trailing_silence = 0
        self.pre_roll.clear()
        return VADEvent("speech_end", frame_id, len(utterance_frames) * self.frame_ms)

probabilities = [0.04, 0.08, 0.66, 0.74, 0.69, 0.22, 0.15, 0.09]
detector = VoiceActivityDetector()
events = []

for frame_id, probability in enumerate(probabilities):
    event = detector.process_frame(frame_id, probability)
    if event:
        events.append(asdict(event))

print(json.dumps(events, indent=2))

Output

[
  {
    "type": "speech_start",
    "frame": 2,
    "audio_ms": 60
  },
  {
    "type": "speech_end",
    "frame": 7,
    "audio_ms": 100
  }
]

The demo uses a short silence_duration_ms = 60 so its output fits in eight frames. A phone support bot might begin testing around 300ms, then tune from measured false cutoffs and lag. Slow speakers may need longer thresholds; push-to-talk or command-style assistants can go shorter. A fixed silence threshold is the floor, not the ceiling: a semantic end-of-turn model can hold through mid-sentence pauses (a caller reciting a phone number digit by digit) and fire early once the utterance is complete, so you don't have to pick one silence value that hurts every other case.^{[5]Reference 5Turn Detection for Voice Agents: VAD, Endpointing, and Model-Based Detectionhttps://livekit.com/blog/turn-detection-voice-agents-vad-endpointing-model-based-detection}

VAD deployment: client vs. server

Choosing where to run VAD changes responsiveness, bandwidth, and operational control. Client-side detection wins on local responsiveness. Server-side detection wins on centralized tuning and consistent turn boundaries across devices.

Feature	Client-Side VAD	Server-Side Detection
Barge-in reaction	Can mute local playout without a network round trip	Includes transport delay before server action reaches playback
Bandwidth	Can suppress silence if upload policy uses its output	Receives the configured upstream stream continuously
Operational control	Requires testing across device/browser classes	Centralizes threshold and model tuning
Turn consistency	Device signals may differ	Can apply one server-side commit policy
Data exposure	Reduced only if capture/upload policy withholds silent frames	Depends on retained audio and server policy

Many production systems use both: lightweight local VAD to mute playback instantly on barge-in, plus server-side turn detection to decide when transcripts are stable enough to trigger response generation.

Hybrid turn-detection diagram where local VAD mutes playback immediately when user speech starts, server endpointing waits for stable boundary before transcript commit, and disagreements feed tuning logs. — Local VAD handles interruption feel; server endpointing still owns final transcript, tool, and analytics boundaries.

The hybrid setup gives the user the fast interruption path they expect while keeping final turn boundaries consistent enough for transcripts, tools, and analytics.

Acoustic echo cancellation (AEC)

AEC prevents the agent from hearing itself. When the agent speaks and the user interrupts, the microphone picks up both the user's voice and the agent's own output playing from the speaker. Without AEC, the VAD would trigger on the agent's voice, creating an echo loop where the agent transcribes its own speech and responds to it.

Modern browser capture stacks expose echo-cancellation constraints so applications can request microphone input with system or remote audio removed before it reaches VAD and STT.^{[7]Reference 7Media Capture and Streamshttps://w3c.github.io/mediacapture-main/} The browser or device owns that processing, and implementations vary. Treat the constraint as a request, then test the resulting audio path across speaker volume, microphone placement, rooms, and device classes.

One common failure is deploying voice agents without AEC testing in real acoustic environments. Laptop speakers and microphones placed close together, as in mobile phones, create strong echo that untested AEC configurations may fail to suppress. The symptom is brutal: the agent interrupts itself.

From sound to text: speech-to-text

Two primary approaches exist for real-time STT. In Alex's call, streaming STT is what lets the system start thinking about "What is incident" before Alex fully finishes saying "incident."

Streaming STT (recommended for latency)

Streaming STT sends partial transcripts as audio arrives, allowing the LLM to pre-read the turn and start cancellable, read-only preparation before the utterance is complete. Provider SDKs differ, but the state machine must separate unstable interim text from committed text that may enter history, trigger tools, or reach TTS.

streaming-stt-recommended-for-latency.py

from dataclasses import asdict, dataclass
import json

@dataclass
class TranscriptEvent:
    kind: str
    text: str
    stable_prefix: str

def common_prefix(left: str, right: str) -> str:
    shared = []
    for left_word, right_word in zip(left.split(), right.split()):
        if left_word != right_word:
            break
        shared.append(left_word)
    return " ".join(shared)

def emit_transcript_events(partials: list[tuple[str, bool]]) -> list[TranscriptEvent]:
    events = []
    previous_text = ""

    for text, is_final in partials:
        stable_prefix = text if is_final else common_prefix(previous_text, text)

        if is_final:
            events.append(TranscriptEvent("final", text, stable_prefix))
        else:
            events.append(TranscriptEvent("interim", text, stable_prefix))
        previous_text = text

    return events

partials = [
    ("What is", False),
    ("What is incident", False),
    ("What is incident number four two", False),
    ("What is incident number four two one nine?", True),
]

print(json.dumps([asdict(event) for event in emit_transcript_events(partials)], indent=2))

Output

[
  {
    "kind": "interim",
    "text": "What is",
    "stable_prefix": ""
  },
  {
    "kind": "interim",
    "text": "What is incident",
    "stable_prefix": "What is"
  },
  {
    "kind": "interim",
    "text": "What is incident number four two",
    "stable_prefix": "What is incident"
  },
  {
    "kind": "final",
    "text": "What is incident number four two one nine?",
    "stable_prefix": "What is incident number four two one nine?"
  }
]

The shared-prefix heuristic is intentionally conservative: it only exposes words repeated across two successive hypotheses. Even that prefix is tentative because a provider can revise earlier words later. Use it only for reversible preparation. The commit boundary matters most when a partial transcript changes meaning. An incident lookup may be safe to prefetch and discard; closing an incident or promising mitigation status isn't.

gating-work-on-transcript-commit.py

import json

def permitted_actions(text: str, committed: bool) -> list[str]:
    actions = ["classify_intent"]
    if "incident" in text.lower():
        actions.append("prefetch_read_only_status")
    if committed:
        actions.extend(["write_history", "speak_response"])
        if "close" in text.lower():
            actions.append("ask_for_close_confirmation")
    return actions

events = [
    {"text": "Close incident four two one nine", "committed": False},
    {"text": "Show incident four two one nine", "committed": True},
]

print(json.dumps([
    {"committed": event["committed"], "actions": permitted_actions(**event)}
    for event in events
], indent=2))

Output

[
  {
    "committed": false,
    "actions": [
      "classify_intent",
      "prefetch_read_only_status"
    ]
  },
  {
    "committed": true,
    "actions": [
      "classify_intent",
      "prefetch_read_only_status",
      "write_history",
      "speak_response"
    ]
  }
]

End-of-utterance STT (simpler, slightly higher latency)

Using a batch recognizer such as Whisper^{[8]Reference 8Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.https://arxiv.org/abs/2212.04356} after an endpoint event is much simpler, but it adds the endpoint wait and full decode after the user stops talking.

end-of-utterance-stt-simpler-slightly-higher.py

import json

def batch_stt_timeline(speech_ms: int, endpoint_ms: int, decode_ms: int) -> dict[str, int | str]:
    acoustic_speech_end_ms = speech_ms
    turn_commit_ms = acoustic_speech_end_ms + endpoint_ms
    transcript_ready_ms = turn_commit_ms + decode_ms
    return {
        "approach": "end-of-utterance",
        "speech_ms": speech_ms,
        "endpoint_ms": endpoint_ms,
        "decode_ms": decode_ms,
        "acoustic_speech_end_ms": acoustic_speech_end_ms,
        "turn_commit_ms": turn_commit_ms,
        "first_transcript_ms": transcript_ready_ms,
    }

print(json.dumps(batch_stt_timeline(speech_ms=1200, endpoint_ms=300, decode_ms=450), indent=2))

Output

{
  "approach": "end-of-utterance",
  "speech_ms": 1200,
  "endpoint_ms": 300,
  "decode_ms": 450,
  "acoustic_speech_end_ms": 1200,
  "turn_commit_ms": 1500,
  "first_transcript_ms": 1950
}

Approach	Latency	Complexity
Streaming STT	Illustrative first partial: ~50-150ms	High
End-of-utterance	Illustrative post-end decode: ~200-800ms+	Low

Thinking before the user finishes: LLM reasoning with streaming

After turn commitment, the LLM should emit a stable first speakable clause without waiting for a long response plan. Before commitment, the system may only do reversible preparation. This makes Time-to-First-Token (TTFT), the duration it takes for the LLM to produce its first output token, relevant without allowing an unstable transcript to become spoken truth; the serving mechanics are covered in Inference Mechanics.

Long inference-time reasoning and slow tools are poor candidates for a low-latency acknowledgement path unless measurements meet its TTFA objective. A voice agent can speak an honest acknowledgement quickly, run complex work asynchronously, and deliver the result when ready instead of pretending every question has an instant answer.

The helper below shows the voice-specific shape: short clauses, no Markdown, tool status stated early, and a stream of text chunks that TTS can consume before the full response is complete.

thinking-before-the-user-finishes-llm.py

import json

def voice_chunks(transcript: str, incident_status: dict[str, str]) -> list[str]:
    reply = (
        f"Incident {incident_status['id']} is still open. "
        f"Severity is {incident_status['severity']}. "
        "I can text you the incident link."
    )
    clauses = [part.strip() + "." for part in reply.split(".") if part.strip()]
    return clauses

transcript = "What is incident number four two one nine?"
chunks = voice_chunks(
    transcript=transcript,
    incident_status={"id": "4219", "severity": "sev-two"},
)

print(json.dumps({
    "contains_incident": "incident" in transcript.lower(),
    "max_words_per_clause": max(len(clause.split()) for clause in chunks),
    "streamed_clauses": chunks,
}, indent=2))

Output

{
  "contains_incident": true,
  "max_words_per_clause": 7,
  "streamed_clauses": [
    "Incident 4219 is still open.",
    "Severity is sev-two.",
    "I can text you the incident link."
  ]
}

Prompt engineering for voice is distinct from text. Voice prompts must explicitly forbid Markdown, which TTS reads poorly, and encourage front-loading the answer so meaningful audio is generated immediately.

Serving optimizations that matter in voice

Live voice sessions stress inference differently from chat. TTFT matters more than long-form throughput because TTS can't speak until the first stable clause arrives. Session length also matters because a 20-minute call can accumulate enough context to make KV-cache management part of the latency budget, not a background infrastructure concern.

Paged KV caches: Long-lived sessions fragment memory if every request expects one contiguous KV-cache region. PagedAttention-style allocators let serving systems grow and trim per-session context without constant copying, which helps keep latency stable across many simultaneous calls.^{[9]Reference 9Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180}
Speculative decoding: A draft model proposes tokens that the target model verifies in parallel, reducing decode latency when acceptance and implementation overhead are favorable.^{[10]Reference 10Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192} It isn't an automatic TTFT improvement; benchmark TTFA and total clause latency on the served model pair.
Aggressive context hygiene: Don't keep every interim transcript forever. Preserve final transcripts, tool results, and only the portion of assistant audio the user heard before barge-in.

Speaking without waiting for the full answer: TTS with streaming

Begin speaking before the full response is generated. The TTS engine receives a stream of text tokens, buffers them into sentences or phrases, and synthesizes audio chunks on the fly.

Waiting for the LLM to finish an entire sentence or paragraph before starting TTS can add seconds of latency. Stream tokens and synthesize clause-by-clause instead.

This function demonstrates buffering text into short speakable clauses. Waiting for a full paragraph is too slow, but firing on every token sounds choppy.

speaking-without-waiting-for-the-full-answer.py

import json

def flush_speakable_clauses(tokens: list[str], max_chars: int = 52) -> list[str]:
    buffer = ""
    flushed = []

    for token in tokens:
        buffer += token
        stripped = buffer.strip()
        if stripped.endswith((",", ".", "!", "?")) or len(stripped) >= max_chars:
            flushed.append(stripped)
            buffer = ""

    if buffer.strip():
        flushed.append(buffer.strip())

    return flushed

tokens = [
    "Incident ", "4219 ", "is ", "still ", "open, ",
    "severity ", "is ", "sev-two. ",
    "I ", "can ", "text ", "the ", "incident ", "link."
]

print(json.dumps(flush_speakable_clauses(tokens), indent=2))

Output

[
  "Incident 4219 is still open,",
  "severity is sev-two.",
  "I can text the incident link."
]

How parallel execution reduces latency

The end-to-end target starts at acoustic speech end, while the internal target starts at turn commit. The production pipeline overlaps stages rather than charging every component serially to either clock. A realistic overlap looks like this:

VAD runs continuously, monitoring audio as it arrives and contributing a speech-end signal to endpointing.
Streaming STT is already running during speech. Silence doesn't start recognition; it helps the endpoint policy decide when the current turn may commit.
Reversible LLM preparation can start from stable STT partials, while spoken or side-effecting work waits for turn commitment.
TTS begins synthesizing as soon as the committed response has a stable speakable clause, which may be well before the LLM finishes its full response.
Audio playback begins as soon as the first TTS chunk is ready.

The illustrated trace lands around 300-450ms time to first audio (TTFA) on its assumed healthy path. Real TTFA depends on endpointing errors, model size, tool routing, and jitter-buffer depth; report its distribution for target network slices.

Voice pipeline overlap chart with endpointing, STT, LLM, and speak lanes. — First audio starts before the full response exists.

Overlap, not faster individual stages alone, drives the latency budget. VAD, STT, LLM, TTS, and playout each has latency, but the first audible word can arrive before every downstream job has fully completed.

When the user interrupts: barge-in handling

When the user speaks over the agent ("barge-in"), the system must react instantly to maintain the illusion of a natural conversation. Imagine Alex asks about an incident, the agent starts explaining mitigation status, and Alex interrupts with "Wait, page the owner." The agent must stop immediately.

Stop Playback: Immediately halt the audio output. A support product might begin with a p95 mute target below 100ms, then tune against measured paths.
Cancel Generation: Abort in-flight TTS and LLM requests to save cost and compute.
State Reconciliation: Update the conversation context to reflect only what the user heard.

Barge-in timeline showing heard audio kept in history and unheard generated audio discarded at the interruption boundary. — Barge-in state must reflect what the user heard, not what the model generated.

Stopping the speaker is only half of barge-in. The next model call should remember only the part of the assistant response that crossed the speaker boundary.

The handler below shows conservative reconciliation at audio-chunk boundaries. When VAD detects fresh user speech during agent playback, the system must stop playout, cancel in-flight generation, and truncate assistant text to the portion confirmed through the speaker.

when-the-user-interrupts-barge-in-handling.py

from dataclasses import dataclass
import json

@dataclass
class PlayedChunk:
    text: str
    text_end_char: int  # Exclusive end offset in pending_text.
    playback_end_ms: int

class InterruptionHandler:
    def __init__(self, pending_text: str, played_chunks: list[PlayedChunk]):
        self.pending_text = pending_text
        self.played_chunks = played_chunks

    def reconcile(self, interrupt_ms: int) -> dict[str, object]:
        heard_chars = self._chars_heard(interrupt_ms)
        spoken_text = self.pending_text[:heard_chars].rstrip()
        discarded_text = self.pending_text[heard_chars:].strip()
        return {
            "history_to_keep": [{"role": "assistant", "content": spoken_text}],
            "discarded_generated_text": discarded_text,
        }

    def _chars_heard(self, interrupt_ms: int) -> int:
        heard_chars = 0
        for chunk in self.played_chunks:
            if chunk.playback_end_ms <= interrupt_ms:
                heard_chars = chunk.text_end_char
            else:
                break
        return heard_chars

pending = "Incident is open because the database failover is still running."
chunks = [
    PlayedChunk("Incident is", 11, 240),
    PlayedChunk(" open", 16, 520),
    PlayedChunk(" because the database", 37, 860),
    PlayedChunk(" failover is still running.", 64, 1120),
]

result = InterruptionHandler(pending, chunks).reconcile(interrupt_ms=650)
print(json.dumps(result, indent=2))

Output

{
  "history_to_keep": [
    {
      "role": "assistant",
      "content": "Incident is open"
    }
  ],
  "discarded_generated_text": "because the database failover is still running."
}

This tiny handler keeps only fully played chunks. text_end_char is an exclusive Python slice boundary: offset 16 keeps pending_text[:16], which is "Incident is open". In production, use smaller chunks or map synthesized spans to actual playout timestamps or RTP sequence ranges when you need finer reconciliation. Don't rely on model tokens alone. Jitter buffers can delay or drop chunks after synthesis, so "generated" isn't the same as "heard."

Cascaded pipeline vs. native audio

Design reviews often ask you to compare two design families: cascaded STT -> LLM -> TTS pipelines and native speech-to-speech models. Early research systems such as AudioPaLM and SpeechGPT showed the idea was viable.^{[2]Reference 2AudioPaLM: A Large Language Model That Can Speak and Listen.https://arxiv.org/abs/2306.12925}^{[3]Reference 3SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities.https://arxiv.org/abs/2305.11000} Native audio is now a production API option: OpenAI's Realtime API supports live speech-to-speech sessions,^{[11]Reference 11Introducing gpt-realtime and Realtime API updates for production voice agentshttps://openai.com/index/introducing-gpt-realtime/} and Google documents Gemini Live native-audio sessions with voice activity detection and tool use.^{[12]Reference 12Gemini Live API overviewhttps://ai.google.dev/gemini-api/docs/live-api} Provider model catalogs change quickly, so verify current aliases when you implement the route. Native models reduce model handoffs and can retain prosodic signals that a text-only boundary loses, but they don't remove transport, VAD, AEC, or state-management problems.

Comparison of cascaded audio with explicit STT, LLM, and TTS stages against native audio with one speech model plus separate transcript, tool, and trace side channels. — Cascaded audio keeps text checkpoints inline. Native audio removes model hops, but transcript, policy, and tracing still sit outside model.

Use the comparison as a control checklist. Native audio can shorten the model path, but you still need stateful sessions, transcript side channels, tool policy, and interruption handling.

Many commercial native-audio APIs use stateful sessions, not one-shot request/response calls. The client streams audio frames in, then receives incremental audio and transcript events back.

Session shape

session-shape.py

from collections import defaultdict
import json

events = [
    {"type": "transcript_delta", "text": "Incident is"},
    {"type": "audio_delta", "audio_ms": 160},
    {"type": "transcript_delta", "text": " still open."},
    {"type": "audio_delta", "audio_ms": 220},
    {"type": "tool_call", "name": "send_incident_link"},
]

state = defaultdict(list)
played_audio_ms = 0

for event in events:
    if event["type"] == "audio_delta":
        played_audio_ms += event["audio_ms"]
    elif event["type"] == "transcript_delta":
        state["transcript"].append(event["text"])
    elif event["type"] == "tool_call":
        state["tool_calls"].append(event["name"])

print(json.dumps({
    "transcript_side_channel": "".join(state["transcript"]),
    "played_audio_ms": played_audio_ms,
    "tool_calls": state["tool_calls"],
}, indent=2))

Output

{
  "transcript_side_channel": "Incident is still open.",
  "played_audio_ms": 380,
  "tool_calls": [
    "send_incident_link"
  ]
}

Audio tokenization

Published native-audio systems commonly avoid feeding raw PCM directly into a transformer. AudioPaLM and SpeechGPT use compressed speech representations rather than modeling every waveform sample as its own sequence element.^{[2]Reference 2AudioPaLM: A Large Language Model That Can Speak and Listen.https://arxiv.org/abs/2306.12925}^{[3]Reference 3SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities.https://arxiv.org/abs/2305.11000}

Generation works in reverse: the model emits audio tokens or latents, a decoder reconstructs waveform chunks, and the client plays them with a jitter buffer. System-design point: compressed speech representations make sequence modeling practical while preserving tone, pacing, and other paralinguistic cues.

Trade-offs

Native audio can avoid some model handoffs and retain paralinguistic features such as tone, hesitation, and emphasis that a strict text bottleneck removes. Whether it produces faster or better interactions is an evaluation result, not an architectural guarantee.

However, native audio is harder to debug and harder to control. Without a clean intermediate transcript, it's tougher to inspect model failure modes, run deterministic tool policies, or align partially spoken output with conversation history. Modular pipelines are still easier to observe, test, and swap component-by-component.

Moving audio across the internet

When moving live audio across the internet, transport choice sets baseline latency and jitter. For browser and mobile voice agents, WebRTC prefers a media-timed UDP path when available, but ICE/TURN deployments need relay and TCP/TLS fallback paths for restricted networks (RFC 8835). WebSockets (persistent, full-duplex TCP connections) remain useful for signaling, transcription-only streams, or server-to-server audio when you're willing to manage media buffering yourself.

TCP provides reliable, ordered byte delivery with sequence numbers, acknowledgments, and retransmission. That's great for text and file transfer, but it creates head-of-line blocking for live audio. If one segment is lost, later bytes wait behind recovery. In voice, that often shows up as stuttery playout or latency spikes.

On a WebRTC UDP media path, loss handling is deadline-aware: the stream can keep moving while packet-loss concealment fills short gaps. For conversation, concealing a short lost frame may be preferable to delaying later audio for recovery; verify this trade-off under the network slices you serve.

WebRTC versus WebSocket transport comparison showing later audio packets continuing past one lost UDP frame while ordered TCP bytes stall behind a missing segment. — For live audio, meeting playout deadline matters more than preserving strict byte order.

The deadline changes the protocol choice. RTP recovery helps only while a packet can still improve playout; TCP recovery must preserve byte order even if the voice moment has already passed.

Jitter buffers and packet loss concealment

Network jitter (variance in packet arrival time) is the enemy of smooth audio. A jitter buffer on the receiver side (media server or client) holds incoming packets for a short duration before playback. A team might begin experiments around 20-60ms, but the right value depends on network quality and how much delay your UX can tolerate.

Static Jitter Buffer: Fixed size (e.g., 50ms). Simple but risks underruns if jitter exceeds buffer size.
Adaptive Jitter Buffer: Dynamically resizes based on network conditions. Expands during congestion (adding latency) and shrinks during stability (reducing latency).

Browser WebRTC stacks already include adaptive jitter buffers and packet loss concealment (PLC). NetEQ is a well-known WebRTC receiver implementation of this idea.^{[13]Reference 13NetEqhttps://source.chromium.org/chromium/chromium/src/+/main:third_party/webrtc/modules/audio_coding/neteq/g3doc/index.md} When packets arrive late or disappear, the receiver can stretch buffered audio, extrapolate from recent audio, or ask the decoder to synthesize concealment frames instead of stalling playout.

Protocol comparison: WebRTC vs. WebSocket

Feature	WebRTC Media (UDP Preferred, Fallbacks Possible)	WebSocket (TCP)
Transport	UDP preferred; relay or TCP/TLS fallback paths possible	TCP
Reliability Model	Media-timed (may use NACK/FEC, but playout stays time-bounded)	Byte-stream reliable (lost bytes block later bytes)
Latency Behavior	Low and stable when tuned	Can spike under loss
Congestion Control	Media-aware adaptation via RTP/RTCP feedback (implementation-dependent)	Managed by the TCP stack, not by media playout deadlines
Encryption	Mandatory (DTLS (Datagram Transport Layer Security) / SRTP (Secure Real-time Transport Protocol))	TLS (over TCP)
Ideal Use Case	Real-time Voice/Video	Chat, Signaling, File Transfer

WebRTC media is richer than "UDP means drop packets." On UDP media paths it layers RTP (Real-time Transport Protocol) / RTCP (RTP Control Protocol) feedback, jitter buffers, packet loss concealment (PLC), and sometimes NACK (Negative Acknowledgment) or FEC (Forward Error Correction).^{[4]Reference 4Media Transport and Use of RTP in WebRTChttps://www.rfc-editor.org/info/rfc8834/} ICE and TURN can select relay or fallback paths when direct UDP is unavailable (RFC 8835). The design goal is timeliness: recover loss when it can still help, otherwise keep audio moving.

Using WebSockets as a drop-in replacement for WebRTC in browser live audio creates hidden work. You can ship with WebSockets, but you must own chunking, playout buffering, backpressure, and worse behavior under packet loss.

Scaling to thousands of conversations

Designing voice AI for scale introduces challenges around state, media routing, and bursty GPU demand. For a system serving 10,000+ concurrent sessions, the ingress tier usually needs sticky routing or consistent hashing so packets for one conversation keep landing on the same media worker.

Voice scaling diagram showing packets pinned to one media worker while STT, LLM, and TTS pools scale independently behind that worker. — Pin packet state to one media worker. Let STT, LLM, and TTS pools burst behind it.

The media tier scales by active sessions and network state. Inference scales by model pressure, which is why separating these tiers usually beats one giant "voice server" process. The routing sketch uses a stable modulo hash only to show affinity with a fixed worker set. A production router should define what happens when workers join or fail, often with connection-aware routing or consistent hashing to limit remapping.

WebRTC connections carry session state. A specific media worker holds Datagram Transport Layer Security (DTLS) keys, Secure Real-time Transport Protocol (SRTP) state, jitter-buffer state, and often playback cursor. If a naive round-robin load balancer routes packets to a different worker mid-session, decryption and timing state break immediately.

Behind media workers, inference components should scale independently because their compute profiles differ. Turn detection and some STT stages often fit on CPU. Largest speech or LLM models usually need GPU. Independent scaling keeps GPU utilization high without overprovisioning lighter tiers.

keeping-media-sessions-affine.py

import hashlib
import json

workers = ["media-a", "media-b", "media-c"]

def route(session_id: str) -> str:
    digest = int(hashlib.sha256(session_id.encode()).hexdigest(), 16)
    return workers[digest % len(workers)]

session_id = "call-alex-4219"
packet_routes = [route(session_id) for _ in range(4)]
failed_worker = packet_routes[0]

print(json.dumps({
    "packet_routes": packet_routes,
    "single_worker_during_session": len(set(packet_routes)) == 1,
    "failed_worker": failed_worker,
    "recovery": "reconnect_and_rebuild_session",
}, indent=2))

Output

{
  "packet_routes": [
    "media-c",
    "media-c",
    "media-c",
    "media-c"
  ],
  "single_worker_during_session": true,
  "failed_worker": "media-c",
  "recovery": "reconnect_and_rebuild_session"
}

Defend the live loop

A strong review explains how first audio arrives fast, how barge-in trims history to heard audio, and why media routing must stay stateful even when model pools scale independently.

Common pitfalls

These are the failure patterns that most often break the voice illusion. In a design review, name the symptom, cause, and fix instead of blaming the model generically.

Symptom: The agent pauses for one or two seconds before it even starts thinking.
Cause: You batch-transcribed the whole utterance instead of using streaming STT partials.
Fix: Emit stable early text so the LLM can pre-read the turn before the final transcript lands.
Symptom: The answer is correct, but the long silence before speech makes the system feel dead.
Cause: You waited for full LLM completion before sending anything to TTS.
Fix: Buffer text into short speakable clauses and flush them as soon as they are stable enough to say aloud.
Symptom: The agent speaks an incorrect incident number or starts an action the caller immediately corrects.
Cause: You treated an unstable STT partial as a committed turn.
Fix: Limit partial-transcript work to reversible preparation; authorize speech and side effects only after turn commitment and required confirmation.
Symptom: Slow speakers get clipped, or every caller feels like they are talking to a laggy robot.
Cause: Endpointing is tuned backward for the call type.
Fix: Tune silence thresholds and semantic end-of-turn rules against false cutoffs, barge-in latency, and time-to-first-audio.
Symptom: After an interruption, the agent answers as if the user heard words that never played.
Cause: You stored generated text instead of playback-confirmed text.
Fix: Align synthesized chunks to playout timestamps or RTP ranges and keep only the heard prefix in history.
Symptom: Browser audio works in demos but stalls badly when the network drops packets.
Cause: You treated WebSockets as a drop-in replacement for WebRTC media.
Fix: Prefer WebRTC for live browser audio, or explicitly own buffering, backpressure, and loss behavior yourself.
Symptom: The agent sounds clever after several seconds, but conversation feels broken.
Cause: You put deep reasoning or heavy tool latency directly in the live voice loop.
Fix: Keep the live model fast and push expensive reasoning into async tools, follow-up workflows, or background actions.

Follow-up questions

The questions below belong in the article, not in metadata, because they are part of the lesson. A reader should be able to answer them from the architecture they just built up.

What strong answers show

Foundational: Explain an example 500ms first-audio objective across endpointing, STT, LLM, TTS, and playout, then name what must be measured before shipping.
Intermediate: Draw the streaming path from mic capture to speaker playout and identify where stable partials allow overlap.
Intermediate: Show how barge-in cancels work, trims history to heard audio, and avoids replaying unheard text into the next turn.
Advanced: Defend WebRTC over WebSockets for browser live audio using deadline-based playout, jitter buffers, and head-of-line blocking.
Advanced: Choose between cascaded and native audio architectures based on observability, tool control, prosody, and failure recovery.
Advanced: Justify session scaling with sticky media routing and independently scaled inference pools.

What to build next

Build something small and measurable to internalize this architecture. These three portfolio projects increase in difficulty:

Voice Mirror: A simple agent that transcribes what you say and speaks it back in a different voice. Start here to wire VAD, STT, and TTS together without the complexity of LLM reasoning.
Incident Hotline Controller: A voice agent that can look up mock incident status and initiate a page via function calls. Add tool use here and measure how latency changes when the LLM must call an API before responding.
Roleplay Operations Agent: A voice agent that uses a clause buffer and records TTFA distributions while giving multi-sentence responses about incident status or escalation policy. This teaches you to tune silence_duration_ms, clause flushing, and barge-in handling against a declared objective.

Voice-agent shipping checks

Building a voice agent with a tight first-audio objective requires measuring every handoff across the processing pipeline. The architecture must be resilient to network jitter while handling the unpredictability of human conversation.

Endpointing, STT, LLM, TTS, tools, and playout all consume or overlap within TTFA; slow work needs acknowledgement or asynchronous handling.
STT streams partials, the LLM streams tokens, and TTS streams audio chunks. Native-audio routes still need measured incremental input, output, and interruption behavior.
Interruption handling is a primary UX differentiator. Systems that don't stop speaking immediately feel robotic and rude.
WebRTC is the live client-media default. Prefer its timed media path and plan for ICE/TURN fallback; WebSockets still fit signaling or managed server-side streaming.
Native audio is a real option. Modular pipelines remain easier to debug and control, while native paths should be chosen only after measuring latency, prosody, policy, and state behavior.

Diagram the lifecycle of a single voice packet from microphone to model and back, calculate the cumulative latency of a pipeline and identify the bottleneck, implement basic barge-in logic, and discuss the trade-offs between cascaded and native speech-to-speech models.

Next Step

Continue to Reasoning & Test-Time Compute

Voice agents showed you how strict latency budgets shape a live product. The final capstone studies the opposite decision: when a difficult request is worth extra inference work before the system releases an answer or irreversible effect.

PreviousDiffusion Models: Images & Text

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Universals and cultural variation in turn-taking in conversation.

Stivers, T., et al. · 2009 · PNAS

AudioPaLM: A Large Language Model That Can Speak and Listen.

Rubenstein, P. K., et al. · 2023 · arXiv preprint

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities.

Zhang, D., et al. · 2023 · arXiv preprint

Media Transport and Use of RTP in WebRTC

Perkins, C., Westerlund, M., & Ott, J. · 2021 · IETF RFC 8834

Turn Detection for Voice Agents: VAD, Endpointing, and Model-Based Detection

Hall, J. (LiveKit) · 2026

Silero VAD: pre-trained enterprise-grade Voice Activity Detector.

Silero Team · 2021

Media Capture and Streams

World Wide Web Consortium · 2026 · W3C Editor's Draft

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.

Radford, A., et al. · 2022 · arXiv preprint

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

Introducing gpt-realtime and Realtime API updates for production voice agents

OpenAI · 2025

Gemini Live API overview

Google · 2026

NetEq

WebRTC Project · 2026 · Chromium WebRTC documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Real-Time Voice AI Agent

What makes voice agents harder than text chat?

The latency illusion

Why might a team set a 500ms first-audio target instead of accepting a 2-second pause?

A live routing line

Why does every component need to hand off partial work?

One packet through the incident-status agent

The latency budget: a worked example

Functional requirements

Example service objectives

Latency budget breakdown

Why is the sequential latency sum misleading?

How the pieces fit together

Why split media transport from inference orchestration?

Knowing when to listen: voice activity detection

VAD deployment: client vs. server

Why run both local VAD and server-side turn detection?

Acoustic echo cancellation (AEC)

What happens if echo cancellation fails during agent playback?

From sound to text: speech-to-text

Streaming STT (recommended for latency)

End-of-utterance STT (simpler, slightly higher latency)

When is batch STT acceptable, and when is it a bad fit?

Thinking before the user finishes: LLM reasoning with streaming

Why should voice prompts forbid Markdown and long setup?

Serving optimizations that matter in voice

Why should interim transcripts not live forever in voice-agent context?

Speaking without waiting for the full answer: TTS with streaming

How parallel execution reduces latency

Why synthesize clauses instead of individual tokens or full paragraphs?

When the user interrupts: barge-in handling

Why must barge-in history use "heard" text, not generated text?

Cascaded pipeline vs. native audio

Session shape

Audio tokenization

Trade-offs

When should you prefer a cascaded STT-LLM-TTS pipeline over native audio?

Moving audio across the internet

Jitter buffers and packet loss concealment

Protocol comparison: WebRTC vs. WebSocket

Why does WebRTC usually beat WebSockets for browser live audio?

Scaling to thousands of conversations

Why do WebRTC sessions need sticky routing?

Defend the live loop

Common pitfalls

Follow-up questions

How does the system handle packet loss in the audio stream?

Why is session affinity required for the load balancer?

What are the trade-offs of using a native audio model instead of a modular pipeline?

How do you synchronize conversation state when the user interrupts?

When should you choose cascaded STT -> LLM -> TTS over a native speech-to-speech model for a production operations line?

What strong answers show

What to build next

Which bug is most likely if the agent acts like the user heard a sentence they interrupted?

Voice-agent shipping checks

Mastery Check

Discussion