LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnSystem Design CapstonesReal-Time Voice AI Agent
🏗️HardSystem Design

Real-Time Voice AI Agent

Master real-time voice AI architecture: turn detection, streaming STT/LLM/TTS, native audio trade-offs, WebRTC transport, and barge-in state.

38 min read
Learning path
Step 150 of 155 in the full curriculum
Diffusion Models & Image GenerationReasoning & Test-Time Compute

Diffusion Models & Image Generation focused on iterative generation under latency and quality trade-offs. Voice agents bring that pressure into live conversation: audio must stream, tools must resolve, and playback must stay interruptible.

A real-time voice AI agent has to listen, decide, speak, and recover quickly enough that the conversation still feels live. This design chapter covers speech pipelines, turn-taking, latency, tools, and fallback behavior.

Imagine calling customer support to track a delayed package. You say, "Where is my order?" and the voice on the other end pauses for two full seconds before answering. In those two seconds, you already wonder if the call dropped, or if the system heard you at all. That silence kills the illusion that you are talking to something intelligent.

Human conversation runs on split-second timing. Studies of turn-taking show that people minimize silence and overlap across languages, with response timing varying by only hundreds of milliseconds across communities.[1] If a voice AI agent takes much longer than that, the conversation feels broken, no matter how smart the underlying model is. Building a system that listens, thinks, and speaks fast enough to feel natural is one of the hardest engineering problems in modern AI.

This article walks through the architecture of a production-grade voice agent, using a concrete example: a package-tracking assistant for an online store. You will follow a single audio packet from the user's mouth to the model's "brain" and back to the speaker, learn why every millisecond matters, and see how the system recovers when the user interrupts mid-sentence. The concepts here apply to any real-time voice product, from customer support to warehouse coordination.

The latency illusion

The primary difference between a text-based chatbot and a voice agent is the continuous flow of time. A text interface lets the user wait patiently while a spinner animates. In a voice interface, silence is a signal. If the agent takes too long to reply, the user assumes it didn't hear them and will likely repeat themselves, causing confusion and overlapping speech.

A useful product target for an interactive support call is a measured time-to-first-audio (TTFA) budget around 500ms for the network and device slices you intend to support. Treat that as a service objective to validate, not a universal boundary: endpoint policy, tools, model choice, and network conditions move the distribution. Research systems such as AudioPaLM and SpeechGPT[2][3] show where speech-native models can go, while cascaded or hybrid pipelines remain useful when transcript inspection, debugging, and tool control matter.

A live routing line

Before analyzing the architecture, it helps to picture the problem. Building a voice agent is like running a live routing line where each component must hand off partial work without waiting for the full case to finish:

  • Voice Activity Detection (VAD): Opens and closes the intake gate when speech starts and ends.
  • Speech-to-Text (STT): Converts the incoming audio stream into partial text.
  • LLM: Uses stable transcript segments to plan the response.
  • Text-to-Speech (TTS): Starts speaking as soon as enough response text is available.
  • Orchestrator: Coordinates handoffs, interruptions, and latency budgets.

If any stage drops packets or runs too slowly, the conversation breaks. The user experiences that as silence, overlap, or a response to a sentence they already corrected.

One packet's journey: the package-tracking agent

To make this concrete, imagine a customer named Alex who calls a voice support line for an online store. Alex says, "Where is my order number four-two-one-nine?"

Here is what happens to that audio, step by step, before Alex hears a response:

  1. Listen (Voice Activity Detection): The system detects when Alex starts speaking and, most importantly, when Alex stops. It keeps background noise out of the speech stream.
  2. Transcribe (Speech-to-Text): The audio is converted to text. This must happen while Alex is still speaking (streaming), not just after Alex finishes.
  3. Think (Large Language Model): The model may prepare from stable partial transcripts, but the orchestrator must not speak a factual answer or execute a side effect until turn-commit policy accepts the input. Once committed, it streams the answer in speakable chunks.
  4. Speak (Text-to-Speech): The voice synthesis engine reads the first few words as soon as they arrive from the LLM, without waiting for the full sentence.

Each of these steps gets its own detailed look below. First, let's set the latency budget so you know what "fast enough" means in practice.

The latency budget: a worked example

Designing a voice AI agent requires balancing functional capabilities with strict latency constraints. The system needs to understand and generate natural language fast enough to mimic human turn-taking.

Functional requirements

  • Real-time bidirectional conversation: Near-full-duplex experience where the user can interrupt at any time.
  • Natural language understanding: Processing speech input into actionable intent.
  • LLM reasoning: Generating context-aware responses and executing tools (order lookup, delivery estimation).
  • Natural speech synthesis: Generating audio with human-like prosody and emotion.
  • Interruption handling (barge-in): The agent must stop speaking immediately when the user interrupts.

Example service objectives

  • Time to first audio: Track p50 and from committed turn end to first played audio; a support line might start with a p95 target of 500ms on declared network slices.
  • Mute on barge-in: Track user-speech onset to muted playback; a product might set a p95 target of 100ms.
  • Concurrency: Size media and inference tiers from forecast active sessions and measured per-session work, rather than assuming 10k+.
  • Audio path: Negotiate supported codecs and sample rates, then measure jitter, loss, and intelligibility by device class.
  • Availability: Choose an error budget for the call product and include media-connect, model, and tool failure modes.

Latency budget breakdown

Use 500ms as an example target, not a law of physics. A deployed route must report TTFA distributions rather than promise one illustrative .

Voice latency budget showing turn detection, first STT partial, first LLM clause, first TTS chunk, and playout buffer ranges. Voice latency budget showing turn detection, first STT partial, first LLM clause, first TTS chunk, and playout buffer ranges.
A 500ms sequential budget only works when stages overlap and first audio starts before full answer completion.

Trace the ranges: each stage adds time, but streaming overlap allows read-only preparation before the final transcript and lets TTS start once an accepted turn has yielded a stable speakable clause.

Treat the 500ms number as a sequential accounting exercise. Good implementations hide some of that with streaming STT, streaming LLM output, and early TTS.

Let's walk through Alex's "Where is my order?" query with real numbers:

StageTimeWhat's happening
Turn detection40-80msVAD decides Alex has stopped speaking
First STT partial50-150msStreaming decoder emits "Where is my order" while Alex is still talking
First LLM clause100-220msModel generates "Your order four-two-one-nine is" based on partial transcript
First TTS chunk80-120msSynthesizer turns that clause into the first 200ms of audio
Playout buffer20-40msJitter buffer smooths network before the speaker

If these ran one after another, the total would be about 500ms. But in practice:

  • STT starts decoding while Alex is still finishing the word "order."
  • The orchestrator can pre-classify intent or start a cancellable read-only lookup from stable words; it cannot commit an order number, speak a result, or perform a side effect from an unstable transcript.
  • After turn commitment, TTS starts synthesizing as soon as the LLM produces a stable clause like "I found that order."

In one candidate trace, this overlap may deliver first audio in 300-450ms. The production decision depends on measured p50/p95 TTFA, incorrect-commit rate, endpointing aggressiveness, model size, and network jitter.

accounting-for-overlapped-first-audio.py
1import json 2 3# One measured trace, in milliseconds after committed turn end. 4spans = { 5 "endpoint_commit": (0, 60), 6 "stt_finalization": (0, 100), 7 "llm_first_clause": (60, 240), 8 "tts_first_chunk": (210, 345), 9 "playout_buffer": (345, 380), 10} 11target_ms = 500 12first_audio_ms = spans["playout_buffer"][1] 13 14print(json.dumps({ 15 "sequential_sum_ms": sum(end - start for start, end in spans.values()), 16 "first_audio_ms": first_audio_ms, 17 "target_ms": target_ms, 18 "meets_trace_target": first_audio_ms <= target_ms, 19 "ship_decision_requires_p95": True, 20}, indent=2))
Output
1{ 2 "sequential_sum_ms": 510, 3 "first_audio_ms": 380, 4 "target_ms": 500, 5 "meets_trace_target": true, 6 "ship_decision_requires_p95": true 7}

How the pieces fit together

The architecture usually separates three concerns: client-side capture and playback, a latency-sensitive media edge, and the inference pipeline. Decoupling transport from inference matters because audio delivery has different failure modes from model execution. Packet loss, jitter, and echo cancellation must be handled on the media path even if the model stack is healthy.

In many deployments, a media gateway terminates WebRTC (Web Real-Time Communication) or SIP (Session Initiation Protocol), handles jitter buffers and routing, then forwards audio to inference over an internal low-latency link.[4] Not every system needs a full SFU (Selective Forwarding Unit) for a 1:1 assistant, but most serious deployments still benefit from a dedicated ingress tier close to the user.

Within inference, the voice pipeline orchestrates turn detection, STT or native-audio encoding, LLM reasoning, tool calls, and TTS or audio decoding. A conversation state manager sits alongside the pipeline, keeping track of dialogue turns, playback offsets, and tool results so the model only commits to what the user actually heard.

Real-time voice agent architecture with client capture, media edge, inference orchestration, state tracking, and barge-in path. Real-time voice agent architecture with client capture, media edge, inference orchestration, state tracking, and barge-in path.
Media transport, inference orchestration, state tracking, and barge-in cancellation are separate control loops.

Follow the flow from left to right: capture, media edge, inference, and back to the speaker. Notice the barge-in path that instantly stops playback when the user interrupts.

Knowing when to listen: voice activity detection

Voice Activity Detection (VAD) determines when the user starts and stops speaking. In Alex's package-tracking call, VAD is what tells the system, "Alex is done asking the question. You can start answering now."

VAD is not the same thing as turn detection. VAD classifies each audio frame as speech or silence. Turn detection (endpointing) decides when the user's turn is actually finished so the agent can answer. Modern stacks increasingly layer three signals: VAD for raw speech presence, transcript-level endpointing for silence after stable words, and a model that judges semantic completeness from the partial transcript. Model-based end-of-turn detectors can commit a turn before trailing silence accumulates, cutting both clipped speakers and laggy responses.[5] You still keep a fast VAD underneath for instant barge-in.

Client-side VAD is common because it gives instant barge-in detection and avoids sending obvious silence. Server-side turn detection is also common because it centralizes tuning and can use richer context. The implementation below shows a generic local wrapper around a streaming VAD backend such as Silero VAD[6].

knowing-when-to-listen-voice-activity.py
1from collections import deque 2from dataclasses import asdict, dataclass 3import json 4 5@dataclass 6class VADEvent: 7 type: str 8 frame: int 9 audio_ms: int 10 11class VoiceActivityDetector: 12 """Detect speech boundaries from one probability per 20ms audio frame.""" 13 14 def __init__( 15 self, 16 frame_ms: int = 20, 17 speech_threshold: float = 0.5, 18 silence_duration_ms: int = 60, 19 prefix_padding_ms: int = 40, 20 ): 21 self.frame_ms = frame_ms 22 self.speech_threshold = speech_threshold 23 self.silence_frames_required = max(1, silence_duration_ms // frame_ms) 24 self.pre_roll = deque(maxlen=max(1, prefix_padding_ms // frame_ms)) 25 self.speech_frames: list[int] = [] 26 self.in_speech = False 27 self.trailing_silence = 0 28 29 def process_frame(self, frame_id: int, probability: float) -> VADEvent | None: 30 """Return speech boundary events for one audio frame.""" 31 self.pre_roll.append(frame_id) 32 if probability >= self.speech_threshold: 33 if not self.in_speech: 34 self.in_speech = True 35 self.speech_frames = list(self.pre_roll) 36 self.trailing_silence = 0 37 return VADEvent("speech_start", frame_id, len(self.speech_frames) * self.frame_ms) 38 self.speech_frames.append(frame_id) 39 self.trailing_silence = 0 40 return None 41 42 if not self.in_speech: 43 return None 44 45 self.speech_frames.append(frame_id) 46 self.trailing_silence += 1 47 if self.trailing_silence < self.silence_frames_required: 48 return None 49 50 utterance_frames = self.speech_frames[:-self.trailing_silence] 51 self.in_speech = False 52 self.speech_frames = [] 53 self.trailing_silence = 0 54 self.pre_roll.clear() 55 return VADEvent("speech_end", frame_id, len(utterance_frames) * self.frame_ms) 56 57probabilities = [0.04, 0.08, 0.66, 0.74, 0.69, 0.22, 0.15, 0.09] 58detector = VoiceActivityDetector() 59events = [] 60 61for frame_id, probability in enumerate(probabilities): 62 event = detector.process_frame(frame_id, probability) 63 if event: 64 events.append(asdict(event)) 65 66print(json.dumps(events, indent=2))
Output
1[ 2 { 3 "type": "speech_start", 4 "frame": 2, 5 "audio_ms": 40 6 }, 7 { 8 "type": "speech_end", 9 "frame": 7, 10 "audio_ms": 80 11 } 12]

silence_duration_ms = 300 is a good starting point, not a universal truth. Phone support bots often go longer to avoid clipping slow speakers. Push-to-talk or assistant experiences can go shorter. A fixed silence threshold is the floor, not the ceiling: a semantic end-of-turn model can hold through mid-sentence pauses (a caller reciting a phone number digit by digit) and fire early once the utterance is clearly complete, so you do not have to pick one silence value that hurts every other case.[5]

VAD deployment: client vs. server

Choosing where to run VAD changes responsiveness, bandwidth, and operational control. Client-side detection wins on local responsiveness. Server-side detection wins on centralized tuning and consistent turn boundaries across devices.

FeatureClient-Side VADServer-Side Detection
Barge-in reactionCan mute local playout without a network round tripIncludes transport delay before server action reaches playback
BandwidthCan suppress silence if upload policy uses its outputReceives the configured upstream stream continuously
Operational controlRequires testing across device/browser classesCentralizes threshold and model tuning
Turn consistencyDevice signals may differCan apply one server-side commit policy
Data exposureReduced only if capture/upload policy withholds silent framesDepends on retained audio and server policy

Many production systems use both: lightweight local VAD to mute playback instantly on barge-in, plus server-side turn detection to decide when transcripts are stable enough to trigger response generation.

Hybrid VAD topology comparing local barge-in detection, server endpointing, and hybrid policy. Hybrid VAD topology comparing local barge-in detection, server endpointing, and hybrid policy.
Local VAD gives fast interruption; server endpointing keeps transcript and analytics boundaries consistent.

The hybrid setup gives the user the fast interruption path they expect while keeping final turn boundaries consistent enough for transcripts, tools, and analytics.

Acoustic echo cancellation (AEC)

A critical but often overlooked component is AEC, which prevents the agent from hearing itself. When the agent speaks and the user interrupts, the microphone picks up both the user's voice and the agent's own output playing from the speaker. Without AEC, the VAD would trigger on the agent's voice, creating an echo loop where the agent transcribes its own speech and responds to it.

Modern browser capture stacks expose echo-cancellation constraints so applications can request microphone input with system or remote audio removed before it reaches VAD and STT.[7] This happens in the time domain using adaptive filters that model the room's acoustic characteristics. The filter continuously adjusts to account for speaker placement, room reverberation, and device variations.

One common failure is deploying voice agents without AEC testing in real acoustic environments. Laptop speakers and microphones placed close together, as in mobile phones, create strong echo that untested AEC configurations may fail to suppress. The symptom is brutal: the agent interrupts itself.

From sound to text: speech-to-text

Two primary approaches exist for real-time STT. In Alex's call, streaming STT is what lets the system start thinking about "Where is my order" before Alex fully finishes saying "order."

Streaming STT (recommended for latency)

Streaming STT sends partial transcripts as audio arrives, allowing the LLM to pre-read the turn and start cancellable, read-only preparation before the utterance is complete. Provider SDKs differ, but the state machine must separate unstable interim text from committed text that may enter history, trigger tools, or reach TTS.

streaming-stt-recommended-for-latency.py
1from dataclasses import asdict, dataclass 2import json 3 4@dataclass 5class TranscriptEvent: 6 kind: str 7 text: str 8 stable_prefix: str 9 10def emit_transcript_events(partials: list[tuple[str, bool]]) -> list[TranscriptEvent]: 11 events = [] 12 13 for text, is_final in partials: 14 words = text.split() 15 stable_prefix = " ".join(words[:-1]) if not is_final else text 16 17 if is_final: 18 events.append(TranscriptEvent("final", text, stable_prefix)) 19 else: 20 events.append(TranscriptEvent("interim", text, stable_prefix)) 21 22 return events 23 24partials = [ 25 ("Where is", False), 26 ("Where is my order", False), 27 ("Where is my order number four two", False), 28 ("Where is my order number four two one nine?", True), 29] 30 31print(json.dumps([asdict(event) for event in emit_transcript_events(partials)], indent=2))
Output
1[ 2 { 3 "kind": "interim", 4 "text": "Where is", 5 "stable_prefix": "Where" 6 }, 7 { 8 "kind": "interim", 9 "text": "Where is my order", 10 "stable_prefix": "Where is my" 11 }, 12 { 13 "kind": "interim", 14 "text": "Where is my order number four two", 15 "stable_prefix": "Where is my order number four" 16 }, 17 { 18 "kind": "final", 19 "text": "Where is my order number four two one nine?", 20 "stable_prefix": "Where is my order number four two one nine?" 21 } 22]

The commit boundary matters most when a partial transcript changes meaning. A package lookup may be safe to prefetch and discard; canceling an order or promising an ETA is not.

gating-work-on-transcript-commit.py
1import json 2 3def permitted_actions(text: str, committed: bool) -> list[str]: 4 actions = ["classify_intent"] 5 if "order" in text.lower(): 6 actions.append("prefetch_read_only_status") 7 if committed: 8 actions.extend(["write_history", "speak_response"]) 9 if "cancel" in text.lower(): 10 actions.append("ask_for_cancel_confirmation") 11 return actions 12 13events = [ 14 {"text": "Cancel order four two one nine", "committed": False}, 15 {"text": "Track order four two one nine", "committed": True}, 16] 17 18print(json.dumps([ 19 {"committed": event["committed"], "actions": permitted_actions(**event)} 20 for event in events 21], indent=2))
Output
1[ 2 { 3 "committed": false, 4 "actions": [ 5 "classify_intent", 6 "prefetch_read_only_status" 7 ] 8 }, 9 { 10 "committed": true, 11 "actions": [ 12 "classify_intent", 13 "prefetch_read_only_status", 14 "write_history", 15 "speak_response" 16 ] 17 } 18]

End-of-utterance STT (simpler, slightly higher latency)

Using a batch recognizer such as Whisper[8] after speech_end is much simpler, but it adds an extra decode step after the user stops talking.

end-of-utterance-stt-simpler-slightly-higher.py
1import json 2 3def batch_stt_timeline(speech_ms: int, endpoint_ms: int, decode_ms: int) -> dict[str, int | str]: 4 speech_end_ms = speech_ms + endpoint_ms 5 transcript_ready_ms = speech_end_ms + decode_ms 6 return { 7 "approach": "end-of-utterance", 8 "speech_ms": speech_ms, 9 "endpoint_ms": endpoint_ms, 10 "decode_ms": decode_ms, 11 "first_transcript_ms": transcript_ready_ms, 12 } 13 14print(json.dumps(batch_stt_timeline(speech_ms=1200, endpoint_ms=300, decode_ms=450), indent=2))
Output
1{ 2 "approach": "end-of-utterance", 3 "speech_ms": 1200, 4 "endpoint_ms": 300, 5 "decode_ms": 450, 6 "first_transcript_ms": 1950 7}
ApproachLatencyComplexity
Streaming STTFirst partial often arrives in ~50-150msHigh
End-of-utteranceAdds ~200-800ms+ after speech_endLow

Thinking before the user finishes: LLM reasoning with streaming

After turn commitment, the LLM should emit a stable first speakable clause without waiting for a long response plan. Before commitment, the system may only do reversible preparation. This makes Time-to-First-Token (TTFT), the duration it takes for the LLM to produce its first output token, relevant without allowing an unstable transcript to become spoken truth; the serving mechanics are covered in Inference Mechanics.

Long inference-time reasoning and slow tools are poor candidates for a low-latency acknowledgement path unless measurements meet its TTFA objective. A voice agent can speak an honest acknowledgement quickly, run complex work asynchronously, and deliver the result when ready instead of pretending every question has an instant answer.

The helper below shows the voice-specific shape: short clauses, no Markdown, tool status stated early, and a stream of text chunks that TTS can consume before the full response is complete.

thinking-before-the-user-finishes-llm.py
1import json 2 3def voice_chunks(transcript: str, order_status: dict[str, str]) -> list[str]: 4 reply = ( 5 f"Your order {order_status['id']} is delayed. " 6 f"It should arrive {order_status['eta']}. " 7 "I can text you the tracking link." 8 ) 9 clauses = [part.strip() + "." for part in reply.split(".") if part.strip()] 10 return clauses 11 12transcript = "Where is my order number four two one nine?" 13chunks = voice_chunks( 14 transcript=transcript, 15 order_status={"id": "4219", "eta": "tomorrow"}, 16) 17 18print(json.dumps({ 19 "contains_order": "order" in transcript.lower(), 20 "max_words_per_clause": max(len(clause.split()) for clause in chunks), 21 "streamed_clauses": chunks, 22}, indent=2))
Output
1{ 2 "contains_order": true, 3 "max_words_per_clause": 7, 4 "streamed_clauses": [ 5 "Your order 4219 is delayed.", 6 "It should arrive tomorrow.", 7 "I can text you the tracking link." 8 ] 9}

Prompt engineering for voice is distinct from text. Voice prompts must explicitly forbid Markdown, which TTS reads poorly, and encourage front-loading the answer so meaningful audio is generated immediately.

Serving optimizations that matter in voice

Live voice sessions stress inference differently from chat. TTFT matters more than long-form throughput because TTS can't speak until the first stable clause arrives. Session length also matters because a 20-minute call can accumulate enough context to make KV-cache management part of the latency budget, not just an infrastructure detail.

  • Paged KV caches: Long-lived sessions fragment memory if every request expects one contiguous KV region. PagedAttention-style allocators let serving systems grow and trim per-session context without constant copying, which helps keep latency stable across many simultaneous calls.[9]
  • Speculative decoding: A draft model proposes tokens that the target model verifies in parallel, reducing decode latency when acceptance and implementation overhead are favorable.[10] It is not an automatic TTFT improvement; benchmark TTFA and total clause latency on the served model pair.
  • Aggressive context hygiene: Don't keep every interim transcript forever. Preserve final transcripts, tool results, and only the portion of assistant audio the user actually heard before barge-in.

Speaking without waiting for the full answer: TTS with streaming

The key optimization is to begin speaking before the full response is generated. The TTS engine receives a stream of text tokens, buffers them into sentences or phrases, and synthesizes audio chunks on the fly.

Waiting for the LLM to finish an entire sentence or paragraph before starting TTS can add seconds of latency. Stream tokens and synthesize clause-by-clause instead.

The function below demonstrates buffering text into short speakable clauses. Waiting for a full paragraph is too slow, but firing on every token sounds choppy.

speaking-without-waiting-for-the-full-answer.py
1import json 2 3def flush_speakable_clauses(tokens: list[str], max_chars: int = 52) -> list[str]: 4 buffer = "" 5 flushed = [] 6 7 for token in tokens: 8 buffer += token 9 stripped = buffer.strip() 10 if stripped.endswith((",", ".", "!", "?")) or len(stripped) >= max_chars: 11 flushed.append(stripped) 12 buffer = "" 13 14 if buffer.strip(): 15 flushed.append(buffer.strip()) 16 17 return flushed 18 19tokens = [ 20 "Your ", "order ", "4219 ", "is ", "delayed, ", 21 "but ", "it ", "should ", "arrive ", "tomorrow. ", 22 "I ", "can ", "text ", "the ", "tracking ", "link." 23] 24 25print(json.dumps(flush_speakable_clauses(tokens), indent=2))
Output
1[ 2 "Your order 4219 is delayed,", 3 "but it should arrive tomorrow.", 4 "I can text the tracking link." 5]

How parallel execution reduces latency

The 500ms budget represents the theoretical sequential sum of all components, but the production pipeline overlaps stages to fit well under that target. Here is a realistic overlap:

  • VAD runs continuously, monitoring audio as it arrives. It triggers the STT call the moment silence is detected.
  • STT runs in parallel with the user still finishing their last words. The first partial transcript arrives while speech is ongoing.
  • Reversible LLM preparation can start from stable STT partials, while spoken or side-effecting work waits for turn commitment.
  • TTS begins synthesizing as soon as the committed response has a stable speakable clause, which may be well before the LLM finishes its full response.
  • Audio playback begins as soon as the first TTS chunk is ready.

The illustrated trace lands around 300-450ms time to first audio (TTFA) on its assumed healthy path. Real TTFA depends on endpointing errors, model size, tool routing, and jitter-buffer depth; report its distribution for target network slices.

Voice pipeline overlap chart with endpointing, STT, LLM, and speak lanes. Voice pipeline overlap chart with endpointing, STT, LLM, and speak lanes.
First audio starts before the full response exists.

The key idea is overlap, not magic speed. VAD, STT, LLM, TTS, and playout each has latency, but the first audible word can arrive before every downstream job has fully completed.

When the user interrupts: barge-in handling

When the user speaks over the agent ("barge-in"), the system must react instantly to maintain the illusion of a natural conversation. Imagine Alex asks about an order, the agent starts explaining the return policy, and Alex interrupts with "Actually, just cancel it." The agent must stop immediately.

  1. Stop Playback: Immediately halt the audio output (less than 100ms).
  2. Cancel Generation: Abort in-flight TTS and LLM requests to save cost and compute.
  3. State Reconciliation: Update the conversation context to reflect only what the user actually heard.
Barge-in timeline showing heard audio kept in history and unheard generated audio discarded at the interruption boundary. Barge-in timeline showing heard audio kept in history and unheard generated audio discarded at the interruption boundary.
Barge-in state must reflect what the user actually heard, not what the model generated.

The hard part isn't stopping the speaker. The hard part is making the next model call remember only the part of the assistant response that crossed the speaker boundary.

The handler below shows the core reconciliation logic. When VAD detects fresh user speech during agent playback, the system must stop playout, cancel in-flight generation, and truncate assistant text to the portion that actually reached the speaker.

when-the-user-interrupts-barge-in-handling.py
1from dataclasses import dataclass 2import json 3 4@dataclass 5class PlayedChunk: 6 text: str 7 text_end_char: int 8 playback_end_ms: int 9 10class InterruptionHandler: 11 def __init__(self, pending_text: str, played_chunks: list[PlayedChunk]): 12 self.pending_text = pending_text 13 self.played_chunks = played_chunks 14 15 def reconcile(self, interrupt_ms: int) -> dict[str, object]: 16 heard_chars = self._chars_heard(interrupt_ms) 17 spoken_text = self.pending_text[:heard_chars].rstrip() 18 discarded_text = self.pending_text[heard_chars:].strip() 19 return { 20 "history_to_keep": [{"role": "assistant", "content": spoken_text}], 21 "discarded_generated_text": discarded_text, 22 } 23 24 def _chars_heard(self, interrupt_ms: int) -> int: 25 heard_chars = 0 26 for chunk in self.played_chunks: 27 if chunk.playback_end_ms <= interrupt_ms: 28 heard_chars = chunk.text_end_char 29 else: 30 break 31 return heard_chars 32 33pending = "Your order is delayed because the carrier missed pickup." 34chunks = [ 35 PlayedChunk("Your order", 10, 240), 36 PlayedChunk(" is delayed", 21, 520), 37 PlayedChunk(" because the carrier", 41, 860), 38 PlayedChunk(" missed pickup.", 56, 1120), 39] 40 41result = InterruptionHandler(pending, chunks).reconcile(interrupt_ms=650) 42print(json.dumps(result, indent=2))
Output
1{ 2 "history_to_keep": [ 3 { 4 "role": "assistant", 5 "content": "Your order is delayed" 6 } 7 ], 8 "discarded_generated_text": "because the carrier missed pickup." 9}

In production, map synthesized chunks to actual playout timestamps or RTP sequence ranges, not to model tokens alone. Jitter buffers can delay or drop chunks after synthesis, so "generated" is not the same as "heard."

Cascaded pipeline vs. native audio

Design reviews often ask you to compare two design families: cascaded STT -> LLM -> TTS pipelines and native speech-to-speech models. Early research systems such as AudioPaLM and SpeechGPT showed the idea was viable.[2][3] As of May 28, 2026, native audio is a production API option: OpenAI documents gpt-realtime for realtime audio input and output,[11] and Google documents Gemini Live native-audio sessions with voice activity detection and tool use.[12] Native models reduce model handoffs and can retain prosodic signals that a text-only boundary loses, but they do not remove transport, VAD, AEC, or state-management problems.

Cascaded STT to LLM to TTS pipeline compared with native speech-to-speech models. Cascaded STT to LLM to TTS pipeline compared with native speech-to-speech models.
Cascaded pipelines expose control points; native audio removes some model boundaries but requires separate evaluation of prosody, latency, and policy behavior.

Use the comparison as a control checklist. Native audio can shorten the model path, but you still need stateful sessions, transcript side channels, tool policy, and interruption handling.

Many commercial native-audio APIs use stateful sessions, not one-shot request/response calls. The client streams audio frames in, then receives incremental audio and transcript events back.

Session shape

session-shape.py
1from collections import defaultdict 2import json 3 4events = [ 5 {"type": "transcript_delta", "text": "Your order"}, 6 {"type": "audio_delta", "audio_ms": 160}, 7 {"type": "transcript_delta", "text": " is delayed."}, 8 {"type": "audio_delta", "audio_ms": 220}, 9 {"type": "tool_call", "name": "send_tracking_link"}, 10] 11 12state = defaultdict(list) 13played_audio_ms = 0 14 15for event in events: 16 if event["type"] == "audio_delta": 17 played_audio_ms += event["audio_ms"] 18 elif event["type"] == "transcript_delta": 19 state["transcript"].append(event["text"]) 20 elif event["type"] == "tool_call": 21 state["tool_calls"].append(event["name"]) 22 23print(json.dumps({ 24 "transcript_side_channel": "".join(state["transcript"]), 25 "played_audio_ms": played_audio_ms, 26 "tool_calls": state["tool_calls"], 27}, indent=2))
Output
1{ 2 "transcript_side_channel": "Your order is delayed.", 3 "played_audio_ms": 380, 4 "tool_calls": [ 5 "send_tracking_link" 6 ] 7}

Audio tokenization

Many native audio systems bypass feeding raw PCM directly into the transformer. They first compress audio into acoustic tokens or other learned latent representations, then run the sequence model over those shorter units.

Generation works in reverse: the model emits audio tokens or latents, a decoder reconstructs waveform chunks, and the client plays them with a jitter buffer. System-design point: compressed speech representations make sequence modeling practical while preserving tone, pacing, and other paralinguistic cues.

Trade-offs

Native audio can avoid some model handoffs and retain paralinguistic features such as tone, hesitation, and emphasis that a strict text bottleneck removes. Whether it produces faster or better interactions is an evaluation result, not an architectural guarantee.

However, native audio is harder to debug and harder to control. Without a clean intermediate transcript, it's tougher to inspect model failure modes, run deterministic tool policies, or align partially spoken output with conversation history. Modular pipelines are still easier to observe, test, and swap component-by-component.

Moving audio across the internet

When moving live audio across the internet, transport choice sets baseline latency and jitter. For browser and mobile voice agents, WebRTC prefers a media-timed UDP path when available, but ICE/TURN deployments need relay and TCP/TLS fallback paths for restricted networks.[4] WebSockets (persistent, full-duplex TCP connections) remain useful for signaling, transcription-only streams, or server-to-server audio when you are willing to manage media buffering yourself.

TCP guarantees delivery by acknowledging every packet. That's great for text and file transfer, but it creates head-of-line blocking for live audio. If one packet is lost, later packets wait behind the retransmission. In voice, that often shows up as stuttery playout or latency spikes in the hundreds of milliseconds.

On a WebRTC UDP media path, loss handling is deadline-aware: the stream can keep moving while packet-loss concealment fills short gaps. For conversation, concealing a short lost frame may be preferable to delaying later audio for recovery; verify this trade-off under the network slices you serve.

WebRTC versus WebSocket transport comparison under packet loss for live audio. WebRTC versus WebSocket transport comparison under packet loss for live audio.
Live audio prefers timed playout over perfect byte recovery.

The difference is a deadline. RTP recovery helps only while a packet can still improve playout; TCP recovery must preserve byte order even if the voice moment has already passed.

Jitter buffers and packet loss concealment

Network jitter (variance in packet arrival time) is the enemy of smooth audio. A jitter buffer on the receiver side (media server or client) holds incoming packets for a short duration before playback. Around 20-60ms is a common starting range, but the right value depends on network quality and how much delay your UX can tolerate.

  • Static Jitter Buffer: Fixed size (e.g., 50ms). Simple but risks underruns if jitter exceeds buffer size.
  • Adaptive Jitter Buffer: Dynamically resizes based on network conditions. Expands during congestion (adding latency) and shrinks during stability (reducing latency).

Browser WebRTC stacks already include adaptive jitter buffers and packet loss concealment (PLC). NetEQ is a well-known WebRTC receiver implementation of this idea.[13] When packets arrive late or disappear, the receiver can stretch buffered audio, extrapolate from recent audio, or ask the decoder to synthesize concealment frames instead of stalling playout.

Protocol comparison: WebRTC vs. WebSocket

FeatureWebRTC Media (UDP Preferred, Fallbacks Possible)WebSocket (TCP)
TransportUDPTCP
Reliability ModelMedia-timed (may use NACK/FEC, but playout stays time-bounded)Byte-stream reliable (lost bytes block later bytes)
Latency BehaviorLow and stable when tunedCan spike under loss
Congestion ControlMedia-aware adaptation via RTP/RTCP feedback (implementation-dependent)Managed by the TCP stack, not by media playout deadlines
EncryptionMandatory (DTLS (Datagram Transport Layer Security) / SRTP (Secure Real-time Transport Protocol))TLS (over TCP)
Ideal Use CaseReal-time Voice/VideoChat, Signaling, File Transfer

WebRTC media is richer than "UDP means drop packets." On UDP media paths it layers RTP (Real-time Transport Protocol) / RTCP (RTP Control Protocol) feedback, jitter buffers, packet loss concealment (PLC), and sometimes NACK (Negative Acknowledgment) or FEC (Forward Error Correction). ICE and TURN can select relay or fallback paths when direct UDP is unavailable.[4] The design goal is timeliness: recover loss when it can still help, otherwise keep audio moving.

Using WebSockets as a drop-in replacement for WebRTC in browser live audio creates hidden work. You can ship with WebSockets, but you must own chunking, playout buffering, backpressure, and worse behavior under packet loss.

Scaling to thousands of conversations

Designing voice AI for scale introduces challenges around state, media routing, and bursty GPU demand. For a system serving 10,000+ concurrent sessions, the ingress tier usually needs sticky routing or consistent hashing so packets for one conversation keep landing on the same media worker.

Scaled voice session architecture showing sticky routing to media workers and independent inference pools. Scaled voice session architecture showing sticky routing to media workers and independent inference pools.
Stateful media workers need sticky routing; inference pools can scale independently.

The media tier scales by active sessions and network state. The inference tier scales by model pressure, which is why separating these tiers usually beats one giant "voice server" process.

WebRTC connections are deeply stateful. A specific media worker holds Datagram Transport Layer Security (DTLS) keys, Secure Real-time Transport Protocol (SRTP) state, jitter-buffer state, and often playback cursor. If a naive round-robin load balancer routes packets to a different worker mid-session, decryption and timing state break immediately.

Behind media workers, inference components should scale independently because their compute profiles differ. Turn detection and some STT stages often fit on CPU. Largest speech or LLM models usually need GPU. Independent scaling keeps GPU utilization high without overprovisioning lighter tiers.

keeping-media-sessions-affine.py
1import hashlib 2import json 3 4workers = ["media-a", "media-b", "media-c"] 5 6def route(session_id: str) -> str: 7 digest = int(hashlib.sha256(session_id.encode()).hexdigest(), 16) 8 return workers[digest % len(workers)] 9 10session_id = "call-alex-4219" 11packet_routes = [route(session_id) for _ in range(4)] 12failed_worker = packet_routes[0] 13 14print(json.dumps({ 15 "packet_routes": packet_routes, 16 "single_worker_during_session": len(set(packet_routes)) == 1, 17 "failed_worker": failed_worker, 18 "recovery": "reconnect_and_rebuild_session", 19}, indent=2))
Output
1{ 2 "packet_routes": [ 3 "media-c", 4 "media-c", 5 "media-c", 6 "media-c" 7 ], 8 "single_worker_during_session": true, 9 "failed_worker": "media-c", 10 "recovery": "reconnect_and_rebuild_session" 11}

Mastery check

At this point you should be able to defend the full live loop: how first audio arrives fast, how barge-in trims history to heard audio, and why media routing must stay stateful even when model pools scale independently.

Common pitfalls

These are the failure patterns that most often break the voice illusion. In a design review, name the symptom, cause, and fix instead of blaming the model generically.

Symptom: The agent pauses for one or two seconds before it even starts thinking. Cause: You batch-transcribed the whole utterance instead of using streaming STT partials. Fix: Emit stable early text so the LLM can pre-read the turn before the final transcript lands.

Symptom: The answer is correct, but the long silence before speech makes the system feel dead. Cause: You waited for full LLM completion before sending anything to TTS. Fix: Buffer text into short speakable clauses and flush them as soon as they are stable enough to say aloud.

Symptom: The agent speaks an incorrect order number or starts an action the caller immediately corrects. Cause: You treated an unstable STT partial as a committed turn. Fix: Limit partial-transcript work to reversible preparation; authorize speech and side effects only after turn commitment and required confirmation.

Symptom: Slow speakers get clipped, or every caller feels like they are talking to a laggy robot. Cause: Endpointing is tuned backward for the call type. Fix: Tune silence thresholds and semantic end-of-turn rules against false cutoffs, barge-in latency, and time-to-first-audio.

Symptom: After an interruption, the agent answers as if the user heard words that never played. Cause: You stored generated text instead of playback-confirmed text. Fix: Align synthesized chunks to playout timestamps or RTP ranges and keep only the heard prefix in history.

Symptom: Browser audio works in demos but stalls badly when the network drops packets. Cause: You treated WebSockets as a drop-in replacement for WebRTC media. Fix: Prefer WebRTC for live browser audio, or explicitly own buffering, backpressure, and loss behavior yourself.

Symptom: The agent sounds clever after several seconds, but conversation feels broken. Cause: You put deep reasoning or heavy tool latency directly in the live voice loop. Fix: Keep the live model fast and push expensive reasoning into async tools, follow-up workflows, or background actions.

Follow-up questions

The questions below belong in the article, not in metadata, because they are part of the lesson. A reader should be able to answer them from the architecture they just built up.

Evaluation rubric

  • Foundational: Explain an example 500ms first-audio objective across endpointing, STT, LLM, TTS, and playout, then name what must be measured before shipping.
  • Intermediate: Draw the streaming path from mic capture to speaker playout and identify where stable partials allow overlap.
  • Intermediate: Show how barge-in cancels work, trims history to heard audio, and avoids replaying unheard text into the next turn.
  • Advanced: Defend WebRTC over WebSockets for browser live audio using deadline-based playout, jitter buffers, and head-of-line blocking.
  • Advanced: Choose between cascaded and native audio architectures based on observability, tool control, prosody, and failure recovery.
  • Advanced: Justify session scaling with sticky media routing and independently scaled inference pools.

What to build next

The best way to internalize this architecture is to build something small and measurable. Here are three progressive projects you can add to a portfolio:

  1. Voice Mirror: A simple agent that transcribes what you say and speaks it back in a different voice. This teaches you how to wire VAD, STT, and TTS together without the complexity of LLM reasoning.

  2. Smart Store Controller: A voice agent that can look up mock order status and initiate returns via function calls. This adds tool use and teaches you how latency changes when the LLM must call an API before responding.

  3. Roleplay Support Agent: A voice agent that uses a clause buffer and records TTFA distributions while giving multi-sentence responses about delivery exceptions or refund policies. This teaches you to tune silence_duration_ms, clause flushing, and barge-in handling against a declared objective.

Key takeaways and next step

Building a voice agent with a tight first-audio objective requires measuring every handoff across the processing pipeline. The architecture must be resilient to network jitter while handling the unpredictability of human conversation.

  • Measured latency objective: Endpointing, STT, LLM, TTS, tools, and playout all consume or overlap within TTFA; slow work needs acknowledgement or asynchronous handling.
  • Stream everything: STT streams partials, the LLM streams tokens, and TTS streams audio chunks.
  • Interruption handling: This is a primary UX differentiator. Systems that don't stop speaking immediately feel robotic and rude.
  • WebRTC is the live client-media default: Prefer its timed media path and plan for ICE/TURN fallback; WebSockets still fit signaling or managed server-side streaming.
  • Native audio is a real option: Modular pipelines remain easier to debug and control, while native paths should be chosen only after measuring latency, prosody, policy, and state behavior.

You should now be able to diagram the lifecycle of a single voice packet from microphone to model and back, calculate the cumulative latency of a pipeline and identify the bottleneck, implement basic barge-in logic, and discuss the trade-offs between cascaded and native speech-to-speech models.

Next Step
Continue to Reasoning & Test-Time Compute

Voice agents showed you the tight real-time streaming and barge-in constraints of conversational interfaces. The final capstone takes those same serving fundamentals (KV cache, batching, latency budgets) and applies them to reasoning agents that deliberately spend more compute at inference time, using difficulty routing, Best-of-N, tree search, and process reward models, to solve the hardest problems the earlier phases prepared you for.

PreviousDiffusion Models & Image Generation
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Universals and cultural variation in turn-taking in conversation.

Stivers, T., et al. · 2009 · PNAS

AudioPaLM: A Large Language Model That Can Speak and Listen.

Rubenstein, P. K., et al. · 2023 · arXiv preprint

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities.

Zhang, D., et al. · 2023 · arXiv preprint

Media Transport and Use of RTP in WebRTC

Perkins, C., Westerlund, M., & Ott, J. · 2021 · IETF RFC 8834

Turn Detection for Voice Agents: VAD, Endpointing, and Model-Based Detection

Hall, J. (LiveKit) · 2026

Silero VAD: pre-trained enterprise-grade Voice Activity Detector.

Silero Team · 2021

Media Capture and Streams

World Wide Web Consortium · 2026 · W3C Editor's Draft

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.

Radford, A., et al. · 2022 · arXiv preprint

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

Introducing gpt-realtime and Realtime API updates for production voice agents

OpenAI · 2025

Gemini Live API overview

Google · 2026

NetEq

WebRTC Project · 2026 · Chromium WebRTC documentation