From 5 Seconds to 0.7 Seconds: How I Built a Production-Ready Voice AI Agent (And Cut Latency by 7x)

Published: (December 1, 2025 at 11:15 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

The tl;dr for the Busy Dev

I built a production‑ready voice AI agent that went from 5+ seconds of latency to sub‑second responses through 8 systematic optimization phases. The journey wasn’t just about code—it was about understanding where bottlenecks hide and how simple changes can have massive impact.

The Stack

  • LiveKit Agents SDK – Real‑time WebRTC infrastructure
  • OpenAI – STT (Whisper → GPT‑4o‑mini‑transcribe) & LLM (GPT‑4o → GPT‑4o‑mini)
  • ElevenLabs – Text‑to‑Speech synthesis
  • Python 3.11 – Implementation language

The Results

  • 🚀 7× faster – Total latency: 5.5 s → 0.7 s (best case)
  • 3‑8× LLM improvement – TTFT: 4.7 s → 0.4 s
  • 💨 98 % STT improvement – Subsequent transcripts: 2.1 s → 0.026 s (near‑instant!)
  • 💰 10× cost reduction – Switched from GPT‑4o to GPT‑4o‑mini
  • 🧠 Context management – Automatic pruning prevents unbounded growth
  • 🔧 MCP integration – Voice agent can now execute document operations via voice commands

Key Insight: Optimization is iterative. Each fix reveals the next bottleneck. Start with metrics, optimize based on data, and don’t be afraid to make “obvious” changes—they often have the biggest impact.

The Challenge: Building a Voice Agent That Doesn’t Feel Like a Robot

The initial implementation felt sluggish: users asked a question, waited > 5 seconds, and got a robotic response. While functional, the experience wasn’t natural.

Human Baseline

Research shows the average human response time is 236 ms after a partner finishes speaking, with a standard deviation of ≈ 520 ms. Most natural responses fall within ≈ 750 ms.

Goal

  • Real‑time speech understanding
  • Fast, intelligent LLM responses
  • Natural‑sounding voice output
  • Graceful handling of interruptions
  • Continuous metrics for optimization

Target latency: ≈ 540 ms (theoretical best‑case for a voice‑agent pipeline, within one human standard deviation).

Voice latency benchmark

The Architecture: Pipeline vs. End‑to‑End

I chose a pipeline approach (STT → LLM → TTS) over speech‑to‑speech models.

Why the Pipeline?

  • Fine‑grained control – Optimize each component independently
  • Flexibility – Swap models/providers per stage
  • Debugging – Inspect intermediate outputs (transcriptions, LLM responses)
  • Cost optimization – Use different models based on requirements
  • Production readiness – Better suited for real‑world applications
  • Granular trade‑offs – Optimize STT for accuracy, LLM for speed, TTS for quality

Trade‑off: More complexity, but worth it for production use cases.

Practical Example

Use caseLatency priority
Restaurant bookingsLLM reasoning
Medical triageSTT accuracy

This flexibility is critical when different applications have distinct latency budgets.

Pipeline vs. end‑to‑end

Phase 1: The Initial Implementation (The Baseline)

Initial Stack

ComponentModel / Service
STTOpenAI Whisper‑1 (batch)
LLMGPT‑4o (high quality, slow)
TTSElevenLabs
VADSilero (lightweight, open‑source)
InfraLiveKit Cloud (WebRTC)

Why LiveKit?
LiveKit’s globally distributed mesh reduces network latency by 20‑50 % versus direct peer‑to‑peer connections. Features include real‑time network measurement, automatic audio compression (97 % size reduction), packet timestamping, and persistent stateful connections—essential for conversational agents.

Initial Performance

  • Total latency: 3.9‑5.5 s (15‑20× slower than human average)
  • LLM TTFT: 1.0‑4.7 s (50‑85 % of total latency) ⚠️
  • STT duration: 0.5‑2.5 s (30‑40 % of latency)
  • TTS TTFB: 0.2‑0.3 s (not a bottleneck)
  • VAD: ≈ 20 ms (minimal)

The LLM was the primary bottleneck; a single slow response (4.7 s) broke the interaction flow.

Baseline latency breakdown

Phase 2: The “Obvious” Fix That Changed Everything

The Discovery

Using GPT‑4o for every response was overkill. GPT‑4o‑mini delivers ~80 % of the quality at 10 % of the cost and is 3‑8× faster.

The Change

# Before
llm = openai.LLM(model="gpt-4o")

# After
llm = openai.LLM(model="gpt-4o-mini")

Results

MetricBeforeAfterImprovement
LLM TTFT1.0‑4.7 s0.36‑0.59 s3‑8× faster
Tokens/sec4.5‑17.711.3‑32.32‑4× faster
Total latency2.3‑3.0 s1.2‑1.5 s (≈ 1.6‑2×)1.6‑2× faster
Cost10× reduction
ConsistencyVariableMuch more predictable

Lesson: The “obvious” fix can be the most impactful. Measure first, then optimize based on data.

LLM performance comparison

Phase 3: Unlocking Real‑Time STT Streaming

(Content continues with streaming implementation, batching removal, and further latency reductions.)

Back to Blog

Related posts

Read more »