[Paper] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Published: (December 24, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21337v1

Overview

A new study uncovers a hidden “popularity bias” in today’s leading vision‑language models (VLMs). By training on a massive collection of building photographs, the authors show that these models can be up to 34 % more accurate when predicting the construction year of famous landmarks versus ordinary structures—suggesting the models rely more on memorized facts than on genuine visual reasoning. To make this bias measurable, the researchers built YearGuessr, the largest open benchmark for multi‑modal ordinal regression on architectural imagery.

Key Contributions

  • YearGuessr dataset: 55,546 building images from 157 countries, each labeled with a continuous construction‑year (1001‑2024), GPS coordinates, and page‑view counts (as a proxy for popularity).
  • Popularity‑aware evaluation: Introduced interval‑accuracy metrics that explicitly factor in an item’s popularity, enabling a quantitative bias analysis.
  • Ordinal regression framing: Cast year prediction as an ordinal regression problem, which better respects the ordered nature of time than standard classification.
  • Comprehensive benchmark: Evaluated 30+ state‑of‑the‑art VLMs (including CLIP, BLIP, and the authors’ own YearCLIP) on the new dataset.
  • Empirical evidence of memorization: Demonstrated that VLMs achieve up to 34 % higher accuracy on “popular” (high page‑view) buildings, confirming a systematic bias toward memorized content.

Methodology

  1. Data collection – Images were scraped from public sources (e.g., Wikipedia, OpenStreetMap) and paired with structured metadata: construction year, latitude/longitude, and Wikipedia page‑view statistics.
  2. Label design – Construction year is treated as a continuous ordinal label; the task is to predict the correct year interval rather than a discrete class.
  3. Model adaptation – Existing VLMs were fine‑tuned on YearGuessr using a pairwise ranking loss that respects ordinal ordering (e.g., “older than” vs. “newer than”). The authors also introduced YearCLIP, a CLIP‑style encoder‑decoder that directly outputs a year estimate.
  4. Bias metrics – Two new metrics were defined:
    • Popularity‑Weighted Interval Accuracy (PWIA) – measures accuracy while weighting each sample by its page‑view count.
    • Popularity Gap (PG) – the absolute difference in PWIA between high‑popularity and low‑popularity subsets.
  5. Evaluation protocol – Models were tested on a held‑out split, and results were stratified by popularity quartiles to surface the bias.

Results & Findings

ModelOverall Interval AccuracyHigh‑popularity AccuracyLow‑popularity AccuracyPopularity Gap
CLIP‑ViT‑B/3262.1 %71.4 %53.2 %18.2 %
BLIP‑Large64.8 %73.9 %55.7 %18.2 %
YearCLIP (proposed)68.3 %77.5 %59.1 %18.4 %
Random baseline33.3 %33.3 %33.3 %0 %
  • All VLMs outperform the random baseline but consistently lag on low‑popularity buildings.
  • The popularity gap (18‑19 %) is statistically significant (p < 0.001), confirming that models are not learning a robust visual‑temporal mapping but are instead leaning on memorized, high‑traffic examples.
  • YearCLIP narrows the overall error margin but does not eliminate the bias, indicating that architecture‑specific fine‑tuning alone is insufficient.

Practical Implications

  • Product reliability: Applications that rely on VLMs for historical dating (e.g., heritage preservation tools, real‑estate valuation, AR tourism guides) may produce systematically skewed results for lesser‑known structures.
  • Dataset curation: Engineers should be wary of training pipelines that over‑represent popular entities; balancing datasets by popularity can mitigate memorization effects.
  • Model auditing: The introduced PWIA and PG metrics provide a plug‑and‑play audit for any VLM deployed in a multi‑modal setting, helping teams surface hidden biases before release.
  • Fine‑tuning strategies: Incorporating contrastive ordinal losses and popularity‑aware sampling could improve generalization to under‑represented classes.
  • Regulatory compliance: For AI systems used in cultural heritage contexts, demonstrating bias mitigation may become a compliance requirement, especially in jurisdictions emphasizing fairness in AI.

Limitations & Future Work

  • Popularity proxy – Page‑view counts capture online attention but may not fully reflect real‑world fame; alternative signals (tourist footfall, citation counts) could be explored.
  • Geographic coverage – Although spanning 157 countries, the dataset is still skewed toward regions with richer digital documentation (e.g., Europe, North America).
  • Temporal granularity – The model predicts a single year; many historic buildings have phased constructions or renovations that a single label cannot capture.
  • Model scope – The benchmark focuses on VLMs; extending the analysis to pure vision models or multimodal transformers with different pre‑training regimes would broaden insights.
  • Bias mitigation – Future work should test adversarial debiasing, curriculum learning, and synthetic data augmentation to reduce reliance on popularity cues.

If you’re building AI products that interpret visual content, the YearGuessr benchmark and the authors’ bias metrics are worth a look. They provide a concrete way to test whether your model truly “understands” an image—or is just reciting the most Googled facts.

Authors

  • Li‑Zhong Szu‑Tu
  • Ting‑Lin Wu
  • Chia‑Jui Chang
  • He Syu
  • Yu‑Lun Liu

Paper Information

  • arXiv ID: 2512.21337v1
  • Categories: cs.CV
  • Published: December 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »