Show HN: Multimodal perception system for real-time conversation

Published: (February 10, 2026 at 01:58 PM EST)
1 min read

Source: Hacker News

Overview

I work on real‑time voice/video AI at Tavus, focusing on how machines respond in a conversation.
Most conversational systems reduce everything to transcripts, discarding many useful visual and audio signals. Existing emotion‑understanding models often classify into small, arbitrary sets and lack the speed or richness needed for real‑time conviction.

To address this, I built a multimodal perception system that encodes visual and audio conversational signals and translates them into natural language by aligning a small LLM on these signals. The agent can “see” and “hear” you, and you can interact with it via an OpenAI‑compatible tool schema in a live conversation.

The system outputs short natural‑language descriptions of what’s happening in the interaction—e.g., uncertainty building, sarcasm, disengagement, or a shift in attention within a single turn.

Specs

  • Real‑time operation per conversation
  • ~15 fps video processing with overlapping audio
  • Handles nuanced emotions, from whispers to shouts
  • Trained on synthetic and internal conversation data

Further Reading

More details are available in the original post:
https://www.tavus.io/post/raven-1-bringing-emotional-intelli…

Discussion

Comments can be found at:
https://news.ycombinator.com/item?id=46965012 (8 points, 1 comment)

0 views
Back to Blog

Related posts

Read more »

Asimov (YC W26) Is Hiring

About the Project We're building training data for humanoid robots by collecting egocentric video of people doing everyday tasks. The Role Wear a phone mounted...

MMAcevedo aka Lena by qntm

Article URL: https://qntm.org/mmacevedo Comments URL: https://news.ycombinator.com/item?id=46999224 Points: 3 Comments: 0...