[Paper] SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Published: (March 17, 2026 at 01:58 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.16859v1

Overview

The paper introduces SocialOmni, the first benchmark that measures how well omni‑modal large language models (OLMs) can interact socially in real‑time audio‑visual conversations. Instead of focusing only on static perception or pure text generation, SocialOmni evaluates whether a model can recognize who is speaking, decide the right moment to jump in, and craft a natural interruption—skills that are essential for truly conversational AI assistants, virtual meeting hosts, and interactive agents.

Key Contributions

  • A three‑dimensional interaction benchmark covering (i) speaker identification, (ii) timing of interruptions, and (iii) phrasing of interruptions.
  • 2,000 perception samples + 209 tightly controlled interaction‑generation instances with explicit temporal and contextual constraints.
  • Audio‑visual inconsistency probes that deliberately mis‑align sound and video to test model robustness to noisy real‑world inputs.
  • Comprehensive evaluation of 12 state‑of‑the‑art OLMs, revealing large gaps between perception accuracy and interactive competence.
  • Diagnostic insights showing that high perceptual scores do not guarantee socially appropriate interruptions, highlighting a new “perception‑interaction” divide.
  • Actionable signals for future model design, suggesting how to close the gap between understanding and interactive behavior.

Methodology

  1. Dataset Construction

    • Collected multi‑person video clips (e.g., meetings, podcasts) with synchronized audio.
    • Annotated each frame with speaker IDs and timestamps for natural pause points.
    • Crafted 209 “interruption” prompts where a model must decide when to interject and what to say, respecting the ongoing dialogue flow.
    • Added “inconsistent” variants where the audio source does not match the visible speaker, forcing models to rely on cross‑modal reasoning.
  2. Benchmark Tasks

    • Speaker Separation & Identification – a classification task: given a short audio‑visual snippet, output the active speaker’s ID.
    • Interruption Timing Control – a regression/decision task: predict the optimal insertion point (in milliseconds) within a live stream.
    • Natural Interruption Generation – a conditional text‑generation task: produce an utterance that is contextually relevant, polite, and temporally aligned.
  3. Evaluation Protocol

    • Perception metrics: accuracy (speaker ID) and timing error (ms).
    • Generation metrics: BLEU/ROUGE for lexical overlap, plus human‑rated social appropriateness and fluency.
    • Robustness checks using the inconsistency set to see if models can detect and correct mismatched cues.
  4. Model Suite

    • Tested 12 publicly available OLMs (e.g., GPT‑4V, LLaVA, Gemini‑Pro Vision) with zero‑shot prompting, as well as a few fine‑tuned variants where possible.

Results & Findings

DimensionBest Perception ScoreBest Interaction Score
Speaker ID Accuracy94% (Model A)68% (Model B)
Timing Error (mean)120 ms (Model C)350 ms (Model D)
Interruption Appropriateness (human rating, 5‑pt)4.2 (Model E)2.8 (Model F)
  • Large variance: Some models excel at identifying speakers but consistently choose awkward interruption moments (e.g., cutting off a speaker mid‑sentence).
  • Perception‑Interaction Decoupling: Correlation (r ≈ 0.32) between speaker‑ID accuracy and interruption quality, indicating that mastering perception alone does not translate to socially competent behavior.
  • Robustness Gap: When audio‑visual streams were deliberately misaligned, most models fell back to the dominant modality (usually audio), leading to a 20‑30% drop in both timing and generation scores.
  • Fine‑tuning helps: A small set of interaction‑focused fine‑tuning examples (≈ 500) boosted the best model’s interruption appropriateness from 3.1 to 4.0, suggesting that targeted data can bridge the gap.

Practical Implications

  • Virtual Meeting Assistants – Models that can wait for a natural pause and offer concise, context‑aware suggestions (e.g., “Can we clarify the budget figure?”) will be far more usable than those that blurt out generic summaries.
  • Customer‑Support Bots – In multi‑agent calls, the ability to identify the right speaker and interject at the right moment can reduce hand‑off friction and improve satisfaction.
  • Live Streaming & Gaming – Real‑time avatars that can “talk over” or “join” a conversation without breaking immersion require the timing and phrasing capabilities SocialOmni measures.
  • Safety & Compliance – Detecting when a speaker is about to say something sensitive and intervening politely (e.g., “Let’s pause and verify the data”) could be built into compliance‑aware AI agents.
  • Model Development Roadmap – The benchmark gives engineers a concrete, quantifiable target beyond static accuracy, encouraging the integration of temporal reasoning and cross‑modal grounding into OLM training pipelines.

Limitations & Future Work

  • Scale of Interaction Samples – Only 209 generation instances; larger, more diverse scenarios (e.g., multilingual, multi‑cultural norms) are needed for broader generalization.
  • Human Evaluation Scope – Social appropriateness was rated by a relatively small pool of annotators; future work should incorporate crowd‑sourced or expert panels to capture nuanced etiquette differences.
  • Static Prompting – The study used zero‑shot prompts for most models; exploring reinforcement‑learning‑from‑human‑feedback (RLHF) specifically for timing decisions could yield stronger results.
  • Real‑World Deployment Tests – Benchmarks are offline; integrating SocialOmni into live systems (e.g., Zoom plugins) would validate whether the measured gains translate to user‑perceived improvements.

SocialOmni shines a light on the missing piece of conversational AI—when and how to speak, not just what to say. As omni‑modal models become the backbone of next‑gen assistants, this benchmark offers a practical yardstick for building agents that truly listen and respond like humans.

Authors

  • Tianyu Xie
  • Jinfa Huang
  • Yuexiao Ma
  • Rongfang Luo
  • Yan Yang
  • Wang Chen
  • Yuhui Zeng
  • Ruize Fang
  • Yixuan Zou
  • Xiawu Zheng
  • Jiebo Luo
  • Rongrong Ji

Paper Information

  • arXiv ID: 2603.16859v1
  • Categories: cs.AI
  • Published: March 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »