[Paper] Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech

Published: (December 15, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.13685v1

Overview

This paper tackles a core challenge in using language‑based AI for Alzheimer’s Disease (AD) screening: distinguishing genuine semantic deficits from superficial text patterns that models might latch onto. By systematically “scrambling” the surface form of spontaneous speech while keeping its meaning intact, the authors show that modern language models can still flag AD, suggesting that the semantic signal is robust enough for early‑stage detection.

Key Contributions

  • Semantic‑only evaluation pipeline: Introduces a novel transformation that rewrites sentences (changing syntax and vocabulary) but preserves meaning, allowing isolation of semantic information from surface cues.
  • Quantitative impact analysis: Demonstrates that classification performance drops only marginally (tiny macro‑F1 change) when surface cues are removed, confirming that models rely on deeper semantic features.
  • Image reconstruction experiment: Tests whether picture‑description transcripts contain enough detail to regenerate the original image using generative models; finds that this adds significant noise and harms AD detection.
  • Interpretability framework: Provides a practical method for detecting and eliminating spurious correlations in clinical NLP pipelines, improving trustworthiness of AI‑based screening tools.
  • Open‑source resources: Releases the transformation scripts and evaluation code, enabling reproducibility and further research on semantic robustness.

Methodology

  1. Data collection – The authors use a standard picture‑description task (e.g., the “Cookie‑Theft” image) collected from participants with and without AD.
  2. Surface‑form transformation – Each transcript is automatically paraphrased using a combination of syntactic reordering, synonym substitution, and controlled language‑model generation. The process is tuned to achieve:
    • Low BLEU/chrF (indicating strong surface changes)
    • High semantic similarity (measured by sentence‑embedding cosine similarity).
  3. Classification models – Pre‑trained transformer‑based classifiers (e.g., BERT, RoBERTa) are fine‑tuned on the original data, then evaluated on three test sets:
    • Original transcripts
    • Transformed (semantic‑preserving) transcripts
    • Image‑reconstructed transcripts (generated from the original picture).
  4. Metrics – Macro‑averaged F1 is the primary metric, complemented by confusion matrices and feature‑importance visualizations to assess where errors shift.

Results & Findings

Test SetMacro‑F1 (baseline)Macro‑F1 (transformed)Macro‑F1 (image‑reconstructed)
Original0.78
Transformed0.75 (Δ‑0.03)
Image‑reconstructed0.62 (Δ‑0.16)
  • Semantic robustness: The modest 0.03 drop shows that the model’s predictive power largely stems from meaning, not just word choice or syntax.
  • Noise sensitivity: When the text is paired with a noisy, AI‑generated image description, performance degrades substantially, confirming that irrelevant visual cues can mislead the classifier.
  • Interpretability gain: Feature‑importance analysis reveals that semantic embeddings (e.g., topic coherence, concept density) dominate the decision process after surface cues are stripped away.

Practical Implications

  • More trustworthy screening tools – Clinicians can deploy language‑model‑based AD detectors with higher confidence that the model is reacting to genuine cognitive decline rather than idiosyncratic phrasing.
  • Data‑efficiency – Since semantic information alone suffices, smaller, privacy‑preserving datasets (e.g., anonymized embeddings) could be shared across institutions without exposing raw speech.
  • Robustness to dialects & accents – By focusing on meaning, systems become less vulnerable to regional vocabulary or speech‑to‑text errors, widening applicability in multilingual settings.
  • Early‑stage detection – Semantic impairments often appear before overt lexical errors; this pipeline could flag subtle deficits that traditional neuro‑psychological tests miss.
  • Regulatory readiness – Demonstrating that models are not over‑fitting to surface artifacts aligns with emerging AI‑in‑health guidelines demanding explainability and bias mitigation.

Limitations & Future Work

  • Transformation quality – Automatic paraphrasing may occasionally alter nuance, potentially under‑estimating the role of subtle linguistic cues.
  • Dataset scope – Experiments are limited to a single picture‑description task; broader conversational or narrative datasets need validation.
  • Model diversity – The study focuses on transformer classifiers; exploring other architectures (e.g., graph‑based semantic parsers) could yield richer insights.
  • Longitudinal assessment – Future work should test whether the semantic‑only pipeline can track disease progression over time, not just binary classification.

Bottom line: By stripping away surface noise and zeroing in on meaning, this research shows that AI can reliably detect Alzheimer’s‑related language changes, paving the way for more interpretable, robust, and clinically useful speech‑based diagnostics.

Authors

  • Dylan Phelps
  • Rodrigo Wilkens
  • Edward Gow‑Smith
  • Lilian Hubner
  • Bárbara Malcorra
  • César Rennó‑Costa
  • Marco Idiart
  • Maria‑Cruz Villa‑Uriol
  • Aline Villavicencio

Paper Information

  • arXiv ID: 2512.13685v1
  • Categories: cs.CL
  • Published: December 15, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »