Show HN: Multimodal perception system for real-time conversation

Published: 2 days ago (February 10, 2026 at 01:58 PM EST)

1 min read

Source: Hacker News

Overview

I work on real‑time voice/video AI at Tavus, focusing on how machines respond in a conversation.
Most conversational systems reduce everything to transcripts, discarding many useful visual and audio signals. Existing emotion‑understanding models often classify into small, arbitrary sets and lack the speed or richness needed for real‑time conviction.

To address this, I built a multimodal perception system that encodes visual and audio conversational signals and translates them into natural language by aligning a small LLM on these signals. The agent can “see” and “hear” you, and you can interact with it via an OpenAI‑compatible tool schema in a live conversation.

The system outputs short natural‑language descriptions of what’s happening in the interaction—e.g., uncertainty building, sarcasm, disengagement, or a shift in attention within a single turn.

Specs

Real‑time operation per conversation
~15 fps video processing with overlapping audio
Handles nuanced emotions, from whispers to shouts
Trained on synthetic and internal conversation data

Discussion

Comments can be found at:
https://news.ycombinator.com/item?id=46965012 (8 points, 1 comment)

Show HN: Multimodal perception system for real-time conversation

Overview

Specs

Further Reading

Discussion

Related posts

GovDash (YC W22) Is Hiring Senior Engineers (Product and Search) in NYC

Asimov (YC W26) Is Hiring

MMAcevedo aka Lena by qntm

New Nick Bostrom Paper: Optimal Timing for Superintelligence [pdf]