[Paper] Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science

Published: (December 19, 2025 at 11:26 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.17752v1

Overview

The ABCDE dataset bundles more than 400 million text snippets—from tweets and blogs to books and AI‑generated prose—each enriched with a comprehensive set of affect‑related annotations. By unifying features that span Affect, Body, Cognition, Demographics, and Emotion, the resource aims to lower the barrier for researchers and developers who want to probe human feelings, health, and social behavior through language.

Key Contributions

  • Massive, multi‑source corpus (400 M+ utterances) covering social media, long‑form writing, and synthetic text.
  • Unified annotation schema (ABCDE) that captures five complementary dimensions of affective information.
  • Open‑access tooling for easy discovery, download, and integration of the dataset into existing pipelines.
  • Cross‑disciplinary relevance, demonstrated through case studies in mental‑health monitoring, political sentiment, and user‑modeling.
  • Benchmark baselines for common affective tasks (emotion detection, age/gender inference, bodily‑state prediction) using state‑of‑the‑art language models.

Methodology

  1. Data Harvesting – The authors scraped publicly available text from four major streams:
    • (i) Twitter (≈ 150 M tweets)
    • (ii) Reddit & blog platforms (≈ 120 M posts)
    • (iii) Digitized books (≈ 80 M sentences)
    • (iv) Large language model (LLM) generators (≈ 50 M synthetic utterances)
  2. Pre‑processing – Duplicate removal, profanity filtering, and language detection left only English‑language content with minimal noise.
  3. Feature Extraction – Six existing lexical resources (e.g., NRC Emotion Lexicon, LIWC, VAD lexicons) and two custom classifiers (a body‑state tagger and a demographic predictor) were run over every utterance. Each token received binary or continuous scores for:
    • Affect (valence, arousal, dominance)
    • Body (mentions of physiological states, pain, fatigue)
    • Cognition (certainty, insight, causation)
    • Demographics (age, gender, education cues)
    • Emotion (basic emotions, complex blends)
  4. Quality Assurance – Random samples were manually verified (≈ 5 k items) to estimate annotation precision (> 85 % for most dimensions).
  5. Packaging – The final corpus is released as compressed JSONL files with accompanying index files and a Python SDK that abstracts away the loading and filtering steps.

Results & Findings

  • Coverage: Over 92 % of utterances received at least one non‑null label across the five dimensions, confirming the feasibility of large‑scale affective annotation.
  • Correlation Patterns: Expected relationships emerged (e.g., high arousal ↔ anger, sadness ↔ low valence) and novel cross‑dimensional links (e.g., body‑related fatigue mentions strongly co‑occur with low‑energy cognitive states).
  • Baseline Performance: Fine‑tuned BERT models trained on ABCDE achieved state‑of‑the‑art F1 scores on standard emotion‑classification benchmarks (≈ 0.78) while also learning to predict demographic cues with > 0.80 accuracy.
  • Synthetic vs. Human Text: AI‑generated utterances displayed a narrower affective range, suggesting that current LLMs may under‑represent certain emotional or bodily states.

Practical Implications

  • Rapid Prototyping – Developers can plug the ABCDE SDK into sentiment‑analysis or user‑profiling services without building custom lexicons from scratch.
  • Mental‑Health Apps – Real‑time detection of body‑state language (e.g., “headache,” “exhausted”) combined with affect scores enables early warning systems for stress or depression.
  • Personalized Content – Marketing platforms can tailor messaging based on inferred demographics and emotional tone, improving engagement while respecting privacy (all data is anonymized).
  • Policy & Social Research – Analysts can track population‑level shifts in affective language across events (elections, pandemics) using a single, consistent feature set.
  • LLM Evaluation – The dataset offers a benchmark for measuring how well generative models capture nuanced affective cues, guiding next‑generation model training.

Limitations & Future Work

  • Bias & Representation – The source mix leans heavily toward English‑speaking, internet‑active populations; under‑represented groups may be mischaracterized.
  • Annotation Noise – Automatic lexicon‑based labeling inevitably introduces errors, especially for sarcasm, idioms, or emerging slang.
  • Static Snapshot – The corpus reflects a specific time window (2020‑2023); affective language evolves, so periodic updates are needed.
  • Future Directions – The authors plan to (i) expand to multilingual corpora, (ii) incorporate multimodal signals (audio/video), (iii) refine demographic predictors with privacy‑preserving techniques, and (iv) develop active‑learning pipelines to improve annotation quality over time.

Authors

  • Jan Philip Wahle
  • Krishnapriya Vishnubhotla
  • Bela Gipp
  • Saif M. Mohammad

Paper Information

  • arXiv ID: 2512.17752v1
  • Categories: cs.CL
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...