VoxTube – Convert YouTube videos to audio with local TTS

Published: (January 30, 2026 at 06:37 PM EST)
1 min read
Source: Dev.to

Source: Dev.to

Problem

I kept queuing YouTube tutorials and talks but never watching them. Video demands attention in a way that audio doesn’t.

Solution

VoxTube extracts transcripts from YouTube videos and converts them to audio using high‑quality TTS, so I can “watch” YouTube during my commute, while cooking, and during workouts.

Technical details

  • Built with Bun + Hono (~300 lines)
  • Uses Kokoro TTS (runs locally via Docker)
  • Caches generated audio
  • No cloud dependencies

What I learned

  • Bun’s file APIs are really nice for streaming audio.
  • Modern TTS (Kokoro) sounds surprisingly natural.
  • Most YouTube videos have transcripts available.

Stats

  • 2 weeks to MVP
  • ~300 lines of code
  • $0 monthly costs (runs locally)

GitHub:

Back to Blog

Related posts

Read more »