[Paper] Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
Source: arXiv - 2602.15757v1
Overview
The paper tackles a pressing problem in content moderation: most automated sexism detectors only decide “sexist vs. not sexist,” which leaves many subtle, context‑dependent instances unnoticed. By introducing a new Spanish‑language multimodal dataset (FineMuSe) and a detailed taxonomy of sexist expressions—including irony and humor—the authors show how large language models (LLMs) can move beyond binary judgments toward fine‑grained, video‑aware detection.
Key Contributions
- FineMuSe dataset: A publicly released collection of Spanish social‑media videos annotated with both binary labels and a hierarchical set of fine‑grained sexism categories.
- Hierarchical taxonomy: A structured label scheme covering multiple sexism sub‑types (e.g., objectification, stereotyping), non‑sexist content, and rhetorical devices such as irony and humor.
- Comprehensive evaluation: Benchmarks of a wide range of multimodal LLMs (e.g., CLIP‑based, Flamingo‑style, GPT‑4V) on both binary and fine‑grained tasks.
- Human‑level performance insight: Evidence that state‑of‑the‑art multimodal LLMs can match human annotators on nuanced sexism detection, but still miss cues that rely heavily on visual context.
Methodology
-
Data collection & annotation
- Curated thousands of short videos from popular Spanish‑language platforms (TikTok, Instagram Reels, YouTube Shorts).
- Each video received a binary label (sexist / non‑sexist) and, when sexist, a more specific tag from the taxonomy (e.g., “sexual objectification,” “gendered stereotype”).
- Annotators also marked whether irony or humor was employed, creating a “rhetorical device” layer.
-
Taxonomy design
- Built a three‑level hierarchy:
- Level 1 – Broad categories (Sexism, Non‑sexism, Irony/Humor).
- Level 2 – Sub‑types (e.g., “Explicit harassment,” “Implicit bias”).
- Level 3 – Fine‑grained cues (e.g., “body‑shaming,” “role‑restriction”).
- Built a three‑level hierarchy:
-
Model suite
- Fine‑tuned several multimodal LLMs on the training split, feeding them both the video frames (sampled at 1 fps) and the accompanying transcript.
- Tested zero‑shot and few‑shot prompting strategies for large foundation models that were not explicitly fine‑tuned.
-
Evaluation
- Standard metrics: accuracy, macro‑F1 for binary; hierarchical‑precision/recall for fine‑grained labels.
- Human baseline: a separate group of annotators re‑rated a held‑out test set to gauge inter‑annotator agreement.
Results & Findings
| Model | Binary Accuracy | Fine‑grained Macro‑F1 | Human Baseline (Macro‑F1) |
|---|---|---|---|
| Multimodal LLM (fine‑tuned) | 92.4 % | 78.1 % | 80.3 % |
| Zero‑shot GPT‑4V | 88.7 % | 71.4 % | — |
| Text‑only LLM (BERT‑es) | 84.2 % | 62.5 % | — |
- Competitive performance: The best fine‑tuned multimodal LLM reaches within 2 % of human macro‑F1 on the fine‑grained task.
- Visual cues matter: Errors cluster around cases where sexist meaning is conveyed primarily through gestures, facial expressions, or background props—situations where text alone is insufficient.
- Irony & humor detection: Models struggle most with ironic or humorous sexism, often misclassifying it as non‑sexist, highlighting the need for richer pragmatic reasoning.
Practical Implications
- Content moderation pipelines: Platforms can replace or augment rule‑based binary filters with models that flag specific sexism sub‑types, enabling more nuanced policy enforcement (e.g., distinguishing “harassment” from “stereotyping”).
- Targeted user feedback: Fine‑grained labels allow automated systems to generate clearer explanations for creators (e.g., “Your video contains gendered stereotyping of professional roles”).
- Cross‑modal safety tools: The study underscores the importance of integrating visual analysis into moderation stacks, especially for short‑form video apps where text captions are sparse.
- Dataset as a benchmark: FineMuSe can serve as a testbed for future multimodal bias‑detection research, encouraging the community to move beyond English‑only, text‑only corpora.
- Policy design: Regulators can reference the taxonomy to define more granular standards for “sexist content,” supporting consistent enforcement across platforms.
Limitations & Future Work
- Language & cultural scope: The dataset is limited to Spanish and reflects cultural nuances specific to Spanish‑speaking online communities; extending to other languages will be necessary for broader applicability.
- Visual representation: Current models process frames at a low temporal resolution, which may miss rapid gestures or subtle facial cues; higher‑frame‑rate or video‑transformer architectures could improve detection.
- Irony & humor: The taxonomy captures these devices, but models still lag behind humans, indicating a need for better pragmatic and commonsense reasoning modules.
- Scalability: Fine‑tuning large multimodal LLMs is computationally expensive; future work could explore lightweight adapters or distillation techniques for production‑ready deployment.
Bottom line: By marrying a richly annotated multimodal dataset with state‑of‑the‑art LLMs, this research pushes sexism detection from a blunt binary tool toward a nuanced, context‑aware system—opening the door for safer, more responsible online video platforms.*
Authors
- Laura De Grazia
- Danae Sánchez Villegas
- Desmond Elliott
- Mireia Farrús
- Mariona Taulé
Paper Information
- arXiv ID: 2602.15757v1
- Categories: cs.CL, cs.AI
- Published: February 17, 2026
- PDF: Download PDF