From 5 Seconds to 0.7 Seconds: How I Built a Production-Ready Voice AI Agent (And Cut Latency by 7x)

Published: 2 months ago (December 1, 2025 at 11:15 PM EST)

3 min read

Source: Dev.to

The tl;dr for the Busy Dev

I built a production‑ready voice AI agent that went from 5+ seconds of latency to sub‑second responses through 8 systematic optimization phases. The journey wasn’t just about code—it was about understanding where bottlenecks hide and how simple changes can have massive impact.

The Stack

LiveKit Agents SDK – Real‑time WebRTC infrastructure
OpenAI – STT (Whisper → GPT‑4o‑mini‑transcribe) & LLM (GPT‑4o → GPT‑4o‑mini)
ElevenLabs – Text‑to‑Speech synthesis
Python 3.11 – Implementation language

The Results

🚀 7× faster – Total latency: 5.5 s → 0.7 s (best case)
⚡ 3‑8× LLM improvement – TTFT: 4.7 s → 0.4 s
💨 98 % STT improvement – Subsequent transcripts: 2.1 s → 0.026 s (near‑instant!)
💰 10× cost reduction – Switched from GPT‑4o to GPT‑4o‑mini
🧠 Context management – Automatic pruning prevents unbounded growth
🔧 MCP integration – Voice agent can now execute document operations via voice commands

Key Insight: Optimization is iterative. Each fix reveals the next bottleneck. Start with metrics, optimize based on data, and don’t be afraid to make “obvious” changes—they often have the biggest impact.

The Challenge: Building a Voice Agent That Doesn’t Feel Like a Robot

The initial implementation felt sluggish: users asked a question, waited > 5 seconds, and got a robotic response. While functional, the experience wasn’t natural.

Human Baseline

Research shows the average human response time is 236 ms after a partner finishes speaking, with a standard deviation of ≈ 520 ms. Most natural responses fall within ≈ 750 ms.

Goal

Real‑time speech understanding
Fast, intelligent LLM responses
Natural‑sounding voice output
Graceful handling of interruptions
Continuous metrics for optimization

Target latency: ≈ 540 ms (theoretical best‑case for a voice‑agent pipeline, within one human standard deviation).

Voice latency benchmark

The Architecture: Pipeline vs. End‑to‑End

I chose a pipeline approach (STT → LLM → TTS) over speech‑to‑speech models.

Why the Pipeline?

Fine‑grained control – Optimize each component independently
Flexibility – Swap models/providers per stage
Debugging – Inspect intermediate outputs (transcriptions, LLM responses)
Cost optimization – Use different models based on requirements
Production readiness – Better suited for real‑world applications
Granular trade‑offs – Optimize STT for accuracy, LLM for speed, TTS for quality

Trade‑off: More complexity, but worth it for production use cases.

Practical Example

Use case	Latency priority
Restaurant bookings	LLM reasoning
Medical triage	STT accuracy

This flexibility is critical when different applications have distinct latency budgets.

Pipeline vs. end‑to‑end

Phase 1: The Initial Implementation (The Baseline)

Initial Stack

Component	Model / Service
STT	OpenAI Whisper‑1 (batch)
LLM	GPT‑4o (high quality, slow)
TTS	ElevenLabs
VAD	Silero (lightweight, open‑source)
Infra	LiveKit Cloud (WebRTC)

Why LiveKit?
LiveKit’s globally distributed mesh reduces network latency by 20‑50 % versus direct peer‑to‑peer connections. Features include real‑time network measurement, automatic audio compression (97 % size reduction), packet timestamping, and persistent stateful connections—essential for conversational agents.

Initial Performance

Total latency: 3.9‑5.5 s (15‑20× slower than human average)
LLM TTFT: 1.0‑4.7 s (50‑85 % of total latency) ⚠️
STT duration: 0.5‑2.5 s (30‑40 % of latency)
TTS TTFB: 0.2‑0.3 s (not a bottleneck)
VAD: ≈ 20 ms (minimal)

The LLM was the primary bottleneck; a single slow response (4.7 s) broke the interaction flow.

Baseline latency breakdown

Phase 2: The “Obvious” Fix That Changed Everything

The Discovery

Using GPT‑4o for every response was overkill. GPT‑4o‑mini delivers ~80 % of the quality at 10 % of the cost and is 3‑8× faster.

The Change

# Before
llm = openai.LLM(model="gpt-4o")

# After
llm = openai.LLM(model="gpt-4o-mini")

Results

Metric	Before	After	Improvement
LLM TTFT	1.0‑4.7 s	0.36‑0.59 s	3‑8× faster
Tokens/sec	4.5‑17.7	11.3‑32.3	2‑4× faster
Total latency	2.3‑3.0 s	1.2‑1.5 s (≈ 1.6‑2×)	1.6‑2× faster
Cost	—	10× reduction	—
Consistency	Variable	Much more predictable	—

Lesson: The “obvious” fix can be the most impactful. Measure first, then optimize based on data.

LLM performance comparison

Phase 3: Unlocking Real‑Time STT Streaming

(Content continues with streaming implementation, batching removal, and further latency reductions.)

From 5 Seconds to 0.7 Seconds: How I Built a Production-Ready Voice AI Agent (And Cut Latency by 7x)

The tl;dr for the Busy Dev

The Stack

The Results