From Prototype to Production: Building a Multimodal Video Search Engine
Source: Dev.to
Overview
In the previous post I explored the power of model stacking for media search by combining CLIP, Whisper, and ArcFace to locate video content through visual descriptions, dialog, and faces. Over the holidays I turned that afternoon hack into a more production‑ready system.
Live Demo
- Demo site: (desktop browser)
- Starter code:
Example workflow
- In the Visual Content tab, type “older man on phone, harbor background” → click +.
- Click the face of the older guy with glasses sitting against the harbor.
- In the Dialog (Semantic mode) tab, type “Americans had launched their missiles” → click +.
- Play the resulting clip.
You’ve drilled down to an exact shot without relying on metadata, timecodes, or exact wording. The semantic search is fuzzy—e.g., the transcript says “What it was telling him was that the US had launched their ICBMs,” and the query still matches.
Architecture
- Frontend: Vue.js served by Nginx
- Backend: FastAPI
- Ingest worker: Standalone process that polls for new media, handling drive mounting/unmounting gracefully (Watchdog is unreliable with NFS/network shares)
- Database: PostgreSQL with the pgvector extension for vector similarity search
All components are orchestrated with docker‑compose.
Features
- Background enrichment – Worker continuously processes new files and extracts visual, audio, and facial embeddings.
- Semantic dialog search – Uses sentence‑transformer embeddings; queries like “Americans launched missiles” retrieve clips containing “US fired rockets.”
- Frame‑accurate playback – HTML5 video decoded to a canvas via
requestVideoFrameCallback(). - EDL export – Queue selected scenes and export a CMX 3600 edit decision list for NLE round‑tripping.
- Unified query – PostgreSQL + pgvector enables vector similarity combined with metadata filtering in a single query.
Code
The full source code and Docker configuration are available at:
Acknowledgements
- Demo footage is from Pioneer One, a Creative Commons‑licensed Canadian drama.
- Significant assistance was provided by Claude Code.