Show HN: I built a sub-500ms latency voice agent from scratch

Published: (March 2, 2026 at 04:23 PM EST)
1 min read

Source: Hacker News

Overview

I built a voice agent from scratch that averages ~400 ms end‑to‑end latency (phone stop → first syllable). That’s with full STT → LLM → TTS in the loop, clean barge‑ins, and no precomputed responses.

What moved the needle

  • Voice is a turn‑taking problem, not a transcription problem. VAD alone fails; you need semantic end‑of‑turn detection.
  • The system reduces to one loop: speaking vs listening. The two transitions – cancel instantly on barge‑in, respond instantly on end‑of‑turn – define the experience.
  • STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation.
  • TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80 ms TTFT was the single biggest win.
  • Geography matters more than prompts. Co‑locate everything or you lose before you start.

References

0 views
Back to Blog

Related posts

Read more »

Iran War Cost Tracker

U.S. TAXPAYER DOLLARS · LIVE ESTIMATE Live Estimate Overview - Operation Epic Fury — Estimated U.S. Cost Since Strikes Began: $0 - Daily Burn Rate Breakdown – T...