Building AI Video Transcription with OpenAI Whisper

Published: 2 months ago (March 1, 2026 at 02:03 PM EST)

4 min read

Source: Dev.to

Source: Dev.to

Why Server-Side Transcription?

Most transcription tools either charge per minute of audio or require you to upload files to a third‑party API. I wanted something that runs on my own hardware, costs nothing per request, and integrates directly with the download pipeline.

OpenAI Whisper was the obvious choice. It’s open source, handles 90+ languages, and the accuracy on the large‑v3 model is genuinely impressive — even with background noise and accented speech.

The Architecture

The stack is straightforward:

Express 5 backend with Socket.IO for real‑time progress updates
yt‑dlp handles video downloading from YouTube, TikTok, Instagram, etc.
ffprobe extracts audio duration metadata
Whisper CLI runs the actual transcription

The flow:

URL → yt-dlp download → ffprobe (duration) → Whisper CLI → JSON segments

Spawning Whisper from Node.js

Whisper runs as a Python CLI tool, so I spawn it as a child process:

import { spawn } from 'child_process';

const args = [
  audioPath,
  '--output_dir', transcriptsDir,
  '--output_format', 'json',
  '--model', 'base',
  '--verbose', 'True',
];

if (language !== 'auto') {
  args.push('--language', language);
}

const whisper = spawn('/path/to/whisper', args);

Whisper writes a JSON file with this structure:

{
  "text": "full transcript text here",
  "segments": [
    { "start": 0.0, "end": 5.04, "text": "First sentence here" },
    { "start": 5.04, "end": 10.2, "text": "Second sentence" }
  ],
  "language": "en"
}

Each segment includes start/end timestamps — useful for building subtitle files or jump‑to‑timestamp features.

The Progress Problem

Whisper doesn’t stream progress; it processes the whole file, then writes output. For a 3‑minute video that can mean 30+ seconds of silence while the user watches a spinner.

Work‑around: estimate completion time based on audio duration and send simulated progress updates over Socket.IO.

const estimatedTime = Math.max(duration * 2, 10);
let currentProgress = 0;

const progressInterval = setInterval(() => {
  if (currentProgress < 95) {
    currentProgress += Math.random() * 5 + 2;
    currentProgress = Math.min(currentProgress, 95);
    emitProgress({
      jobId,
      stage: 'transcribe',
      progress: Math.round(currentProgress),
      message: `Transcribing... ${Math.round(currentProgress)}%`,
    });
  }
}, estimatedTime * 10);

Is this elegant? No. Does it work? Users stop hitting refresh, so yes.

Model Selection Trade‑offs

Whisper comes in several sizes. Here’s what I found in practice:

Model	Speed (3 min audio)	Accuracy	VRAM
tiny	~5 sec	Decent for clear speech	~1 GB
base	~10 sec	Solid for most content	~1 GB
small	~30 sec	Noticeably better	~2 GB
medium	~90 sec	Great	~5 GB
large‑v3	~3 min	Best available	~10 GB

I default to base for the free tier: fast enough that users don’t abandon the page, accurate enough for most use cases. The tiny model occasionally garbles words, especially in non‑English content.

For production I’d recommend base as default, with an option to bump up to small or medium when accuracy matters more than speed.

Handling Failures

Whisper can fail in ways that aren’t immediately obvious:

Corrupted audio – yt-dlp sometimes produces files that ffmpeg can decode but Whisper chokes on. I added a pre‑check using ffprobe to validate the audio stream before sending it to Whisper.
Memory limits – The large‑v3 model needs ~10 GB VRAM. On a machine with 8 GB it silently falls back to CPU and takes ~10× longer. Set explicit timeouts or users will wait forever.
Language detection hiccups – Whisper’s auto‑detect usually works, but it can confuse similar languages (Ukrainian vs Russian, Spanish vs Portuguese). Letting users pick the language themselves fixes this.

My timeout sits at 30 minutes – generous, but some long videos with the medium model genuinely need that much time on CPU.

What I’d Change Now

Use faster‑whisper. The CTranslate2‑based implementation is ~4× faster with the same accuracy. I started with the vanilla OpenAI CLI for simplicity but would switch for any serious deployment.
Add a job queue. Right now transcription runs inline. Two concurrent jobs on the same GPU will both slow to a crawl. Bull or BullMQ with proper concurrency limits would fix this.
Cache transcripts. The same video URL should return cached results instead of re‑processing. Storage is cheap; CPU time isn’t.

Try It Yourself

If you want to see this in action, Videolyti lets you download a video and transcribe it in one go. Paste a YouTube or TikTok link, download, hit Transcribe, and get timestamped text back.

No signup, no payment. Just works.

Building something similar? I’m happy to talk through the Socket.IO progress tracking or error handling in more detail — drop a comment or reach out.

Building AI Video Transcription with OpenAI Whisper

Why Server-Side Transcription?

The Architecture

Spawning Whisper from Node.js

The Progress Problem

Model Selection Trade‑offs

Handling Failures

What I’d Change Now

Try It Yourself

Related posts

Anthropic Adds Free Memory Feature and Import Tool to Lure ChatGPT Users to Claude

Apple Might Use Google Servers To Store Data For Its Upgraded AI Siri

Free Claude users can now use memory and import context from rivals

Anthropic upgrades Claude’s memory to attract AI switchers