[Paper] CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences
Source: arXiv - 2512.10918v1
Overview
The paper introduces CompanionCast, a modular framework that brings together multiple AI “companions” to watch videos with you—complete with spoken dialogue, personality quirks, and spatial audio that makes each voice sound like it’s coming from a specific spot on the screen. By letting these agents react to the video in real time, the system aims to recreate the feeling of watching a game or a show with friends, even when you’re alone.
Key Contributions
- Multi‑agent orchestration layer that synchronizes role‑specialized LLMs (e.g., commentator, fan, analyst) with video streams and audio output.
- LLM‑as‑a‑Judge evaluation module that scores ongoing conversations on relevance, authenticity, engagement, diversity, and personality consistency, and feeds the scores back to improve the agents’ responses.
- Spatial‑audio rendering pipeline that places each agent’s synthesized voice in a 3‑D sound field, enhancing the sense of co‑presence.
- Pilot user study with soccer fans showing that multi‑agent co‑viewing boosts perceived social presence compared with solo viewing.
- Generalizable design that can be swapped into other domains (education, entertainment, collaborative work) with minimal re‑engineering.
Methodology
- Video Ingestion – The system extracts visual and audio cues (e.g., scene changes, crowd noise, on‑screen text) from a live or pre‑recorded video feed.
- Agent Roles – Different LLM instances are assigned distinct personas (e.g., “enthusiastic fan”, “tactical analyst”, “casual commentator”). Each receives the same multimodal context but is prompted to respond according to its role.
- Conversation Loop – Agents generate short utterances, which are passed to a Judge LLM. The Judge scores each utterance on five quality dimensions and can request revisions or re‑ranking.
- Speech Synthesis + Spatial Audio – Approved utterances are turned into speech via a TTS engine, then positioned in a virtual sound space (e.g., left‑speaker for the fan, right‑speaker for the analyst) using binaural rendering.
- User Interaction – Viewers can optionally speak or type to the agents, allowing the system to adapt its dialogue in real time.
All components communicate through a lightweight message bus, making it easy for developers to replace any sub‑module (e.g., swap GPT‑4 for a smaller open‑source model).
Results & Findings
- Social Presence Score – In a controlled experiment with 30 soccer fans, participants rated the CompanionCast experience 23 % higher on a standard social presence questionnaire than a baseline solo‑watch condition.
- Engagement Metrics – Average interaction time (clicks, typed messages) increased by 18 % when multiple agents were present, indicating that users stayed more involved.
- Judge Effectiveness – The LLM‑as‑a‑Judge reduced off‑topic or repetitive utterances by 42 % compared to a naïve generation pipeline, leading to smoother conversations.
- Audio Realism – Subjective listening tests showed that spatial audio contributed a 15 % uplift in perceived “being in the room with others,” confirming the value of 3‑D sound placement.
Practical Implications
- Streaming Platforms – Services like Netflix, Twitch, or sports broadcasters could embed CompanionCast agents to offer “virtual watch‑party” experiences without needing real friends to be online.
- Remote Collaboration – Teams reviewing training videos, design mock‑ups, or code walkthroughs could benefit from role‑specific AI assistants that comment, ask questions, and keep the discussion lively.
- Education – Teachers could deploy a panel of AI “students” that ask clarifying questions or provide alternative explanations while a lecture video plays, making remote learning feel more interactive.
- Developer Toolkit – Because the framework is built on standard APIs (LLM endpoints, WebRTC video streams, binaural audio libraries), developers can prototype new agent personas or integrate domain‑specific knowledge bases with a few lines of code.
Limitations & Future Work
- Domain Specificity – The pilot focused on soccer; performance in narrative movies or news broadcasts remains untested.
- Latency – Real‑time synchronization of video cues, LLM inference, and audio rendering can introduce noticeable delays on low‑end hardware.
- Judge Overhead – Running an extra LLM for quality control doubles inference cost, which may be prohibitive for large‑scale deployments.
- User Personalization – Current agents follow static personas; future work will explore dynamic personality adaptation based on user preferences and interaction history.
Overall, CompanionCast opens a promising path toward AI‑driven co‑viewing experiences that feel socially rich, while also highlighting the engineering challenges that need to be tackled before it becomes a mainstream feature.
Authors
- Yiyang Wang
- Chen Chen
- Tica Lin
- Vishnu Raj
- Josh Kimball
- Alex Cabral
- Josiah Hester
Paper Information
- arXiv ID: 2512.10918v1
- Categories: cs.HC, cs.CL
- Published: December 11, 2025
- PDF: Download PDF