[Paper] Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos

Published: (February 17, 2026 at 12:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.15757v1

Overview

The paper tackles a pressing problem in content moderation: most automated sexism detectors only decide “sexist vs. not sexist,” which leaves many subtle, context‑dependent instances unnoticed. By introducing a new Spanish‑language multimodal dataset (FineMuSe) and a detailed taxonomy of sexist expressions—including irony and humor—the authors show how large language models (LLMs) can move beyond binary judgments toward fine‑grained, video‑aware detection.

Key Contributions

  • FineMuSe dataset:  A publicly released collection of Spanish social‑media videos annotated with both binary labels and a hierarchical set of fine‑grained sexism categories.
  • Hierarchical taxonomy:  A structured label scheme covering multiple sexism sub‑types (e.g., objectification, stereotyping), non‑sexist content, and rhetorical devices such as irony and humor.
  • Comprehensive evaluation:  Benchmarks of a wide range of multimodal LLMs (e.g., CLIP‑based, Flamingo‑style, GPT‑4V) on both binary and fine‑grained tasks.
  • Human‑level performance insight:  Evidence that state‑of‑the‑art multimodal LLMs can match human annotators on nuanced sexism detection, but still miss cues that rely heavily on visual context.

Methodology

  1. Data collection & annotation

    • Curated thousands of short videos from popular Spanish‑language platforms (TikTok, Instagram Reels, YouTube Shorts).
    • Each video received a binary label (sexist / non‑sexist) and, when sexist, a more specific tag from the taxonomy (e.g., “sexual objectification,” “gendered stereotype”).
    • Annotators also marked whether irony or humor was employed, creating a “rhetorical device” layer.
  2. Taxonomy design

    • Built a three‑level hierarchy:
      • Level 1 – Broad categories (Sexism, Non‑sexism, Irony/Humor).
      • Level 2 – Sub‑types (e.g., “Explicit harassment,” “Implicit bias”).
      • Level 3 – Fine‑grained cues (e.g., “body‑shaming,” “role‑restriction”).
  3. Model suite

    • Fine‑tuned several multimodal LLMs on the training split, feeding them both the video frames (sampled at 1 fps) and the accompanying transcript.
    • Tested zero‑shot and few‑shot prompting strategies for large foundation models that were not explicitly fine‑tuned.
  4. Evaluation

    • Standard metrics: accuracy, macro‑F1 for binary; hierarchical‑precision/recall for fine‑grained labels.
    • Human baseline: a separate group of annotators re‑rated a held‑out test set to gauge inter‑annotator agreement.

Results & Findings

ModelBinary AccuracyFine‑grained Macro‑F1Human Baseline (Macro‑F1)
Multimodal LLM (fine‑tuned)92.4 %78.1 %80.3 %
Zero‑shot GPT‑4V88.7 %71.4 %
Text‑only LLM (BERT‑es)84.2 %62.5 %
  • Competitive performance: The best fine‑tuned multimodal LLM reaches within 2 % of human macro‑F1 on the fine‑grained task.
  • Visual cues matter: Errors cluster around cases where sexist meaning is conveyed primarily through gestures, facial expressions, or background props—situations where text alone is insufficient.
  • Irony & humor detection: Models struggle most with ironic or humorous sexism, often misclassifying it as non‑sexist, highlighting the need for richer pragmatic reasoning.

Practical Implications

  • Content moderation pipelines: Platforms can replace or augment rule‑based binary filters with models that flag specific sexism sub‑types, enabling more nuanced policy enforcement (e.g., distinguishing “harassment” from “stereotyping”).
  • Targeted user feedback: Fine‑grained labels allow automated systems to generate clearer explanations for creators (e.g., “Your video contains gendered stereotyping of professional roles”).
  • Cross‑modal safety tools: The study underscores the importance of integrating visual analysis into moderation stacks, especially for short‑form video apps where text captions are sparse.
  • Dataset as a benchmark: FineMuSe can serve as a testbed for future multimodal bias‑detection research, encouraging the community to move beyond English‑only, text‑only corpora.
  • Policy design: Regulators can reference the taxonomy to define more granular standards for “sexist content,” supporting consistent enforcement across platforms.

Limitations & Future Work

  • Language & cultural scope: The dataset is limited to Spanish and reflects cultural nuances specific to Spanish‑speaking online communities; extending to other languages will be necessary for broader applicability.
  • Visual representation: Current models process frames at a low temporal resolution, which may miss rapid gestures or subtle facial cues; higher‑frame‑rate or video‑transformer architectures could improve detection.
  • Irony & humor: The taxonomy captures these devices, but models still lag behind humans, indicating a need for better pragmatic and commonsense reasoning modules.
  • Scalability: Fine‑tuning large multimodal LLMs is computationally expensive; future work could explore lightweight adapters or distillation techniques for production‑ready deployment.

Bottom line: By marrying a richly annotated multimodal dataset with state‑of‑the‑art LLMs, this research pushes sexism detection from a blunt binary tool toward a nuanced, context‑aware system—opening the door for safer, more responsible online video platforms.*

Authors

  • Laura De Grazia
  • Danae Sánchez Villegas
  • Desmond Elliott
  • Mireia Farrús
  • Mariona Taulé

Paper Information

  • arXiv ID: 2602.15757v1
  • Categories: cs.CL, cs.AI
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »