Show HN: I built a sub-500ms latency voice agent from scratch

Published: 23 hours ago (March 2, 2026 at 04:23 PM EST)

1 min read

Source: Hacker News

Overview

I built a voice agent from scratch that averages ~400 ms end‑to‑end latency (phone stop → first syllable). That’s with full STT → LLM → TTS in the loop, clean barge‑ins, and no precomputed responses.

What moved the needle

Voice is a turn‑taking problem, not a transcription problem. VAD alone fails; you need semantic end‑of‑turn detection.
The system reduces to one loop: speaking vs listening. The two transitions – cancel instantly on barge‑in, respond instantly on end‑of‑turn – define the experience.
STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation.
TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80 ms TTFT was the single biggest win.
Geography matters more than prompts. Co‑locate everything or you lose before you start.

References

Comments: Hacker News discussion (Points: 11, Comments: 3)

Show HN: I built a sub-500ms latency voice agent from scratch

Overview

What moved the needle

References

Related posts

Iran War Cost Tracker

Intel's make-or-break 18A process node debuts for data center with 288-core Xeon

Why payment fees matter more than you think

You are going to get priced out of the best AI coding tools