Show HN: I built a sub-500ms latency voice agent from scratch
Source: Hacker News
Overview
I built a voice agent from scratch that averages ~400 ms end‑to‑end latency (phone stop → first syllable). That’s with full STT → LLM → TTS in the loop, clean barge‑ins, and no precomputed responses.
What moved the needle
- Voice is a turn‑taking problem, not a transcription problem. VAD alone fails; you need semantic end‑of‑turn detection.
- The system reduces to one loop: speaking vs listening. The two transitions – cancel instantly on barge‑in, respond instantly on end‑of‑turn – define the experience.
- STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation.
- TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80 ms TTFT was the single biggest win.
- Geography matters more than prompts. Co‑locate everything or you lose before you start.
References
- Comments: Hacker News discussion (Points: 11, Comments: 3)