Building AI Video Transcription with OpenAI Whisper

Published: (March 1, 2026 at 02:03 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Why Server-Side Transcription?

Most transcription tools either charge per minute of audio or require you to upload files to a third‑party API. I wanted something that runs on my own hardware, costs nothing per request, and integrates directly with the download pipeline.

OpenAI Whisper was the obvious choice. It’s open source, handles 90+ languages, and the accuracy on the large‑v3 model is genuinely impressive — even with background noise and accented speech.

The Architecture

The stack is straightforward:

  • Express 5 backend with Socket.IO for real‑time progress updates
  • yt‑dlp handles video downloading from YouTube, TikTok, Instagram, etc.
  • ffprobe extracts audio duration metadata
  • Whisper CLI runs the actual transcription

The flow:

URL → yt-dlp download → ffprobe (duration) → Whisper CLI → JSON segments

Spawning Whisper from Node.js

Whisper runs as a Python CLI tool, so I spawn it as a child process:

import { spawn } from 'child_process';

const args = [
  audioPath,
  '--output_dir', transcriptsDir,
  '--output_format', 'json',
  '--model', 'base',
  '--verbose', 'True',
];

if (language !== 'auto') {
  args.push('--language', language);
}

const whisper = spawn('/path/to/whisper', args);

Whisper writes a JSON file with this structure:

{
  "text": "full transcript text here",
  "segments": [
    { "start": 0.0, "end": 5.04, "text": "First sentence here" },
    { "start": 5.04, "end": 10.2, "text": "Second sentence" }
  ],
  "language": "en"
}

Each segment includes start/end timestamps — useful for building subtitle files or jump‑to‑timestamp features.

The Progress Problem

Whisper doesn’t stream progress; it processes the whole file, then writes output. For a 3‑minute video that can mean 30+ seconds of silence while the user watches a spinner.

Work‑around: estimate completion time based on audio duration and send simulated progress updates over Socket.IO.

const estimatedTime = Math.max(duration * 2, 10);
let currentProgress = 0;

const progressInterval = setInterval(() => {
  if (currentProgress < 95) {
    currentProgress += Math.random() * 5 + 2;
    currentProgress = Math.min(currentProgress, 95);
    emitProgress({
      jobId,
      stage: 'transcribe',
      progress: Math.round(currentProgress),
      message: `Transcribing... ${Math.round(currentProgress)}%`,
    });
  }
}, estimatedTime * 10);

Is this elegant? No. Does it work? Users stop hitting refresh, so yes.

Model Selection Trade‑offs

Whisper comes in several sizes. Here’s what I found in practice:

ModelSpeed (3 min audio)AccuracyVRAM
tiny~5 secDecent for clear speech~1 GB
base~10 secSolid for most content~1 GB
small~30 secNoticeably better~2 GB
medium~90 secGreat~5 GB
large‑v3~3 minBest available~10 GB

I default to base for the free tier: fast enough that users don’t abandon the page, accurate enough for most use cases. The tiny model occasionally garbles words, especially in non‑English content.

For production I’d recommend base as default, with an option to bump up to small or medium when accuracy matters more than speed.

Handling Failures

Whisper can fail in ways that aren’t immediately obvious:

  • Corrupted audioyt-dlp sometimes produces files that ffmpeg can decode but Whisper chokes on. I added a pre‑check using ffprobe to validate the audio stream before sending it to Whisper.
  • Memory limits – The large‑v3 model needs ~10 GB VRAM. On a machine with 8 GB it silently falls back to CPU and takes ~10× longer. Set explicit timeouts or users will wait forever.
  • Language detection hiccups – Whisper’s auto‑detect usually works, but it can confuse similar languages (Ukrainian vs Russian, Spanish vs Portuguese). Letting users pick the language themselves fixes this.

My timeout sits at 30 minutes – generous, but some long videos with the medium model genuinely need that much time on CPU.

What I’d Change Now

  • Use faster‑whisper. The CTranslate2‑based implementation is ~4× faster with the same accuracy. I started with the vanilla OpenAI CLI for simplicity but would switch for any serious deployment.
  • Add a job queue. Right now transcription runs inline. Two concurrent jobs on the same GPU will both slow to a crawl. Bull or BullMQ with proper concurrency limits would fix this.
  • Cache transcripts. The same video URL should return cached results instead of re‑processing. Storage is cheap; CPU time isn’t.

Try It Yourself

If you want to see this in action, Videolyti lets you download a video and transcribe it in one go. Paste a YouTube or TikTok link, download, hit Transcribe, and get timestamped text back.

No signup, no payment. Just works.

Building something similar? I’m happy to talk through the Socket.IO progress tracking or error handling in more detail — drop a comment or reach out.

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...