From 5 Seconds to 0.7 Seconds: How I Built a Production-Ready Voice AI Agent (And Cut Latency by 7x)
Source: Dev.to
The tl;dr for the Busy Dev
I built a production‑ready voice AI agent that went from 5+ seconds of latency to sub‑second responses through 8 systematic optimization phases. The journey wasn’t just about code—it was about understanding where bottlenecks hide and how simple changes can have massive impact.
The Stack
- LiveKit Agents SDK – Real‑time WebRTC infrastructure
- OpenAI – STT (Whisper → GPT‑4o‑mini‑transcribe) & LLM (GPT‑4o → GPT‑4o‑mini)
- ElevenLabs – Text‑to‑Speech synthesis
- Python 3.11 – Implementation language
The Results
- 🚀 7× faster – Total latency: 5.5 s → 0.7 s (best case)
- ⚡ 3‑8× LLM improvement – TTFT: 4.7 s → 0.4 s
- 💨 98 % STT improvement – Subsequent transcripts: 2.1 s → 0.026 s (near‑instant!)
- 💰 10× cost reduction – Switched from GPT‑4o to GPT‑4o‑mini
- 🧠 Context management – Automatic pruning prevents unbounded growth
- 🔧 MCP integration – Voice agent can now execute document operations via voice commands
Key Insight: Optimization is iterative. Each fix reveals the next bottleneck. Start with metrics, optimize based on data, and don’t be afraid to make “obvious” changes—they often have the biggest impact.
The Challenge: Building a Voice Agent That Doesn’t Feel Like a Robot
The initial implementation felt sluggish: users asked a question, waited > 5 seconds, and got a robotic response. While functional, the experience wasn’t natural.
Human Baseline
Research shows the average human response time is 236 ms after a partner finishes speaking, with a standard deviation of ≈ 520 ms. Most natural responses fall within ≈ 750 ms.
Goal
- Real‑time speech understanding
- Fast, intelligent LLM responses
- Natural‑sounding voice output
- Graceful handling of interruptions
- Continuous metrics for optimization
Target latency: ≈ 540 ms (theoretical best‑case for a voice‑agent pipeline, within one human standard deviation).

The Architecture: Pipeline vs. End‑to‑End
I chose a pipeline approach (STT → LLM → TTS) over speech‑to‑speech models.
Why the Pipeline?
- Fine‑grained control – Optimize each component independently
- Flexibility – Swap models/providers per stage
- Debugging – Inspect intermediate outputs (transcriptions, LLM responses)
- Cost optimization – Use different models based on requirements
- Production readiness – Better suited for real‑world applications
- Granular trade‑offs – Optimize STT for accuracy, LLM for speed, TTS for quality
Trade‑off: More complexity, but worth it for production use cases.
Practical Example
| Use case | Latency priority |
|---|---|
| Restaurant bookings | LLM reasoning |
| Medical triage | STT accuracy |
This flexibility is critical when different applications have distinct latency budgets.

Phase 1: The Initial Implementation (The Baseline)
Initial Stack
| Component | Model / Service |
|---|---|
| STT | OpenAI Whisper‑1 (batch) |
| LLM | GPT‑4o (high quality, slow) |
| TTS | ElevenLabs |
| VAD | Silero (lightweight, open‑source) |
| Infra | LiveKit Cloud (WebRTC) |
Why LiveKit?
LiveKit’s globally distributed mesh reduces network latency by 20‑50 % versus direct peer‑to‑peer connections. Features include real‑time network measurement, automatic audio compression (97 % size reduction), packet timestamping, and persistent stateful connections—essential for conversational agents.
Initial Performance
- Total latency: 3.9‑5.5 s (15‑20× slower than human average)
- LLM TTFT: 1.0‑4.7 s (50‑85 % of total latency) ⚠️
- STT duration: 0.5‑2.5 s (30‑40 % of latency)
- TTS TTFB: 0.2‑0.3 s (not a bottleneck)
- VAD: ≈ 20 ms (minimal)
The LLM was the primary bottleneck; a single slow response (4.7 s) broke the interaction flow.

Phase 2: The “Obvious” Fix That Changed Everything
The Discovery
Using GPT‑4o for every response was overkill. GPT‑4o‑mini delivers ~80 % of the quality at 10 % of the cost and is 3‑8× faster.
The Change
# Before
llm = openai.LLM(model="gpt-4o")
# After
llm = openai.LLM(model="gpt-4o-mini")
Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| LLM TTFT | 1.0‑4.7 s | 0.36‑0.59 s | 3‑8× faster |
| Tokens/sec | 4.5‑17.7 | 11.3‑32.3 | 2‑4× faster |
| Total latency | 2.3‑3.0 s | 1.2‑1.5 s (≈ 1.6‑2×) | 1.6‑2× faster |
| Cost | — | 10× reduction | — |
| Consistency | Variable | Much more predictable | — |
Lesson: The “obvious” fix can be the most impactful. Measure first, then optimize based on data.

Phase 3: Unlocking Real‑Time STT Streaming
(Content continues with streaming implementation, batching removal, and further latency reductions.)