[Paper] CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences

Published: (December 11, 2025 at 01:44 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10918v1

Overview

The paper introduces CompanionCast, a modular framework that brings together multiple AI “companions” to watch videos with you—complete with spoken dialogue, personality quirks, and spatial audio that makes each voice sound like it’s coming from a specific spot on the screen. By letting these agents react to the video in real time, the system aims to recreate the feeling of watching a game or a show with friends, even when you’re alone.

Key Contributions

  • Multi‑agent orchestration layer that synchronizes role‑specialized LLMs (e.g., commentator, fan, analyst) with video streams and audio output.
  • LLM‑as‑a‑Judge evaluation module that scores ongoing conversations on relevance, authenticity, engagement, diversity, and personality consistency, and feeds the scores back to improve the agents’ responses.
  • Spatial‑audio rendering pipeline that places each agent’s synthesized voice in a 3‑D sound field, enhancing the sense of co‑presence.
  • Pilot user study with soccer fans showing that multi‑agent co‑viewing boosts perceived social presence compared with solo viewing.
  • Generalizable design that can be swapped into other domains (education, entertainment, collaborative work) with minimal re‑engineering.

Methodology

  1. Video Ingestion – The system extracts visual and audio cues (e.g., scene changes, crowd noise, on‑screen text) from a live or pre‑recorded video feed.
  2. Agent Roles – Different LLM instances are assigned distinct personas (e.g., “enthusiastic fan”, “tactical analyst”, “casual commentator”). Each receives the same multimodal context but is prompted to respond according to its role.
  3. Conversation Loop – Agents generate short utterances, which are passed to a Judge LLM. The Judge scores each utterance on five quality dimensions and can request revisions or re‑ranking.
  4. Speech Synthesis + Spatial Audio – Approved utterances are turned into speech via a TTS engine, then positioned in a virtual sound space (e.g., left‑speaker for the fan, right‑speaker for the analyst) using binaural rendering.
  5. User Interaction – Viewers can optionally speak or type to the agents, allowing the system to adapt its dialogue in real time.

All components communicate through a lightweight message bus, making it easy for developers to replace any sub‑module (e.g., swap GPT‑4 for a smaller open‑source model).

Results & Findings

  • Social Presence Score – In a controlled experiment with 30 soccer fans, participants rated the CompanionCast experience 23 % higher on a standard social presence questionnaire than a baseline solo‑watch condition.
  • Engagement Metrics – Average interaction time (clicks, typed messages) increased by 18 % when multiple agents were present, indicating that users stayed more involved.
  • Judge Effectiveness – The LLM‑as‑a‑Judge reduced off‑topic or repetitive utterances by 42 % compared to a naïve generation pipeline, leading to smoother conversations.
  • Audio Realism – Subjective listening tests showed that spatial audio contributed a 15 % uplift in perceived “being in the room with others,” confirming the value of 3‑D sound placement.

Practical Implications

  • Streaming Platforms – Services like Netflix, Twitch, or sports broadcasters could embed CompanionCast agents to offer “virtual watch‑party” experiences without needing real friends to be online.
  • Remote Collaboration – Teams reviewing training videos, design mock‑ups, or code walkthroughs could benefit from role‑specific AI assistants that comment, ask questions, and keep the discussion lively.
  • Education – Teachers could deploy a panel of AI “students” that ask clarifying questions or provide alternative explanations while a lecture video plays, making remote learning feel more interactive.
  • Developer Toolkit – Because the framework is built on standard APIs (LLM endpoints, WebRTC video streams, binaural audio libraries), developers can prototype new agent personas or integrate domain‑specific knowledge bases with a few lines of code.

Limitations & Future Work

  • Domain Specificity – The pilot focused on soccer; performance in narrative movies or news broadcasts remains untested.
  • Latency – Real‑time synchronization of video cues, LLM inference, and audio rendering can introduce noticeable delays on low‑end hardware.
  • Judge Overhead – Running an extra LLM for quality control doubles inference cost, which may be prohibitive for large‑scale deployments.
  • User Personalization – Current agents follow static personas; future work will explore dynamic personality adaptation based on user preferences and interaction history.

Overall, CompanionCast opens a promising path toward AI‑driven co‑viewing experiences that feel socially rich, while also highlighting the engineering challenges that need to be tackled before it becomes a mainstream feature.

Authors

  • Yiyang Wang
  • Chen Chen
  • Tica Lin
  • Vishnu Raj
  • Josh Kimball
  • Alex Cabral
  • Josiah Hester

Paper Information

  • arXiv ID: 2512.10918v1
  • Categories: cs.HC, cs.CL
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »