Show HN: Multimodal perception system for real-time conversation
Source: Hacker News
Overview
I work on real‑time voice/video AI at Tavus, focusing on how machines respond in a conversation.
Most conversational systems reduce everything to transcripts, discarding many useful visual and audio signals. Existing emotion‑understanding models often classify into small, arbitrary sets and lack the speed or richness needed for real‑time conviction.
To address this, I built a multimodal perception system that encodes visual and audio conversational signals and translates them into natural language by aligning a small LLM on these signals. The agent can “see” and “hear” you, and you can interact with it via an OpenAI‑compatible tool schema in a live conversation.
The system outputs short natural‑language descriptions of what’s happening in the interaction—e.g., uncertainty building, sarcasm, disengagement, or a shift in attention within a single turn.
Specs
- Real‑time operation per conversation
- ~15 fps video processing with overlapping audio
- Handles nuanced emotions, from whispers to shouts
- Trained on synthetic and internal conversation data
Further Reading
More details are available in the original post:
https://www.tavus.io/post/raven-1-bringing-emotional-intelli…
Discussion
Comments can be found at:
https://news.ycombinator.com/item?id=46965012 (8 points, 1 comment)