[Paper] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

Published: (December 11, 2025 at 01:11 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.10882v1

Overview

The paper investigates how well multimodal large language models (mLLMs)—AI systems that can process text, audio, and video together—can detect emotional arousal in political video recordings. By benchmarking these models against human‑annotated datasets, the author shows that while mLLMs can be highly reliable in controlled settings, they stumble on real‑world parliamentary footage, raising concerns for analysts who rely on AI‑driven sentiment tools.

Key Contributions

  • First systematic evaluation of current multimodal LLMs for video‑based emotion detection in political communication.
  • Two complementary datasets: (1) a lab‑controlled set of human‑labeled videos, and (2) authentic parliamentary debate recordings.
  • Demonstrates high inter‑annotator reliability of mLLM arousal scores under ideal conditions, with minimal demographic bias.
  • Reveals a performance drop on real‑world political footage, highlighting risks for downstream statistical analyses.
  • Provides a replicable evaluation framework (code, prompts, and metrics) for future research on multimodal AI in the social sciences.

Methodology

  1. Model Selection – The study tests several publicly available multimodal LLMs (e.g., GPT‑4V, LLaVA, and Gemini Vision) that accept video input and output textual sentiment scores.
  2. Datasets
    • Controlled Corpus: 500 short video clips (actors expressing a range of arousal levels) manually labeled by multiple annotators.
    • Parliamentary Corpus: 300 minutes of live debate footage from a national parliament, also human‑annotated for arousal.
  3. Prompt Engineering – Uniform prompts ask the model to “rate the speaker’s emotional arousal on a scale of 1‑7,” ensuring comparable outputs across models.
  4. Evaluation Metrics – Pearson’s r and Krippendorff’s α assess agreement with human labels; demographic bias is probed by correlating errors with speaker gender, age, and ethnicity.
  5. Statistical Checks – The author runs downstream regression analyses (e.g., arousal vs. voting outcomes) to see how model errors propagate into typical political‑science inferences.

Results & Findings

  • Controlled Corpus: mLLMs achieve r ≈ 0.85 with human ratings and α ≈ 0.80, indicating strong reliability. Bias analysis shows no systematic error linked to speaker demographics.
  • Parliamentary Corpus: Performance falls to r ≈ 0.45 and α ≈ 0.40. Errors are larger for speakers with subtle facial expressions or overlapping audio, and a modest bias emerges for gender (slightly lower scores for female speakers).
  • Downstream Impact: When using mLLM arousal scores in regression models predicting legislative support, the coefficient estimates shift by up to 30 % compared to human‑based scores, potentially leading to misleading conclusions.

Practical Implications

  • Tool Selection: Developers building sentiment‑analysis pipelines for media monitoring should treat current mLLMs as high‑confidence only in controlled or pre‑processed video streams.
  • Pre‑processing Needs: Enhancing audio‑visual quality (e.g., speaker isolation, lighting normalization) can mitigate the drop in real‑world performance.
  • Bias Audits: Even though demographic bias is low in lab settings, regular bias checks are essential before deploying models on live political content.
  • Research Automation: The provided evaluation framework can be integrated into CI pipelines for political‑science tools, ensuring that model updates don’t silently degrade analytical validity.
  • Policy & Compliance: Organizations using AI to assess political speech must be aware that inaccurate arousal scores could skew public‑opinion dashboards or misinform compliance reporting.

Limitations & Future Work

  • The study evaluates only a handful of publicly released mLLMs; newer or proprietary models may behave differently.
  • Temporal dynamics (e.g., changes in arousal across a speech) are not captured—only static clip ratings are examined.
  • The parliamentary dataset is limited to a single country’s legislature; cross‑cultural validation is needed.
  • Future research should explore multimodal fine‑tuning on domain‑specific video corpora, incorporate continuous arousal trajectories, and develop robust bias mitigation strategies for diverse speaker populations.

Authors

  • Hauke Licht

Paper Information

  • arXiv ID: 2512.10882v1
  • Categories: cs.CL
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »