[Paper] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Published: (November 25, 2025 at 10:14 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21006v1

Overview

The paper “TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models” investigates why today’s open‑source LLMs excel at definition‑style queries but stumble when asked for examples, paraphrases, or deeper explanations—especially for rare or technical concepts. By building a diagnostic pipeline (TrackList) and a new medical‑term dataset (RefoMed‑EN), the authors expose how pre‑training data frequency shapes a model’s ability to handle diverse linguistic requests.

Key Contributions

  • TrackList pipeline – a fine‑grained, reproducible framework that combines linguistic annotation, statistical analysis, and embedding‑based similarity metrics to evaluate LLM responses across multiple query types.
  • RefoMed‑EN dataset – 6,170 human‑annotated medical terms paired with definitions, denominations, exemplifications, explanations, and paraphrases, providing a benchmark for “head vs. tail” knowledge.
  • Empirical study of head/tail effects – systematic comparison of model performance on high‑frequency (head) versus low‑frequency (tail) concepts across five answer styles.
  • Insight into paraphrasing bias – evidence that LLMs tend to paraphrase popular concepts more aggressively while preserving original wording for rare, technical items.
  • Open‑source release – code, data, and analysis scripts are publicly available, enabling the community to extend the evaluation to other domains or models.

Methodology

  1. Query Generation – For each term in RefoMed‑EN the authors crafted five prompt templates targeting different linguistic outputs:

    • Definition (what is X?)
    • Denomination (what is another name for X?)
    • Exemplification (give an example of X)
    • Explanation (why does X happen?)
    • Paraphrase (re‑state X in other words).
  2. Model Inference – Several open LLMs (e.g., LLaMA‑2, Falcon, Mistral) were queried with the same prompts, keeping temperature and max‑tokens constant to isolate linguistic capability.

  3. TrackList Analysis – The pipeline evaluates each generated answer on three fronts:

    • Syntactic similarity (BLEU, ROUGE) against the human reference.
    • Semantic similarity (Sentence‑BERT cosine similarity, BERTScore).
    • Statistical correlation between term frequency in the pre‑training corpus (estimated via public token‑frequency tables) and the similarity scores.
  4. Head/Tail Split – Terms were bucketed into “head” (top 10 % frequency) and “tail” (bottom 10 %) groups, allowing a direct comparison of performance across knowledge rarity.

  5. Statistical Testing – Paired t‑tests and Spearman’s ρ assess significance of observed gaps between query types and frequency buckets.

Results & Findings

Query TypeAvg. Semantic Similarity (Head)Avg. Semantic Similarity (Tail)Relative Drop vs. Definition
Definition0.840.78
Denomination0.710.66–15 %
Explanation0.680.60–19 %
Exemplification0.520.44–38 %
Paraphrase0.770.71–9 %
  • Definition queries consistently achieved the highest similarity scores, confirming that LLMs are most reliable for factual recall.
  • Exemplification suffered the steepest performance drop, especially on tail concepts, indicating a weakness in generating concrete examples for rare knowledge.
  • Paraphrasing bias: For head concepts, models frequently rewrote definitions (higher lexical divergence) while preserving wording for tail items, suggesting a “copy‑when‑uncertain” strategy.
  • Statistical correlation: Term frequency correlated positively with all similarity metrics (Spearman ρ ≈ 0.42, p < 0.001), reinforcing the head‑vs‑tail effect.

Practical Implications

  • Product developers building chat‑bots or knowledge‑base assistants should treat LLM‑generated examples with caution, especially for niche domains (e.g., rare medical conditions, specialized engineering terms).
  • Prompt engineering: Adding explicit “give an example” scaffolding or providing few‑shot exemplars can mitigate the exemplification gap.
  • Data curation: Augmenting pre‑training corpora with balanced coverage of tail concepts (synthetic data, domain‑specific corpora) is likely to improve downstream performance on non‑definition queries.
  • Evaluation pipelines: TrackList can be integrated into CI/CD for LLM‑powered services, automatically flagging regressions in answer diversity before release.
  • Compliance & safety: Since models tend to paraphrase popular knowledge more aggressively, they may inadvertently introduce hallucinations for well‑known facts; monitoring paraphrase fidelity becomes a compliance requirement in regulated fields (e.g., healthcare).

Limitations & Future Work

  • Domain focus: The study centers on medical terminology; results may differ for other technical domains or for general‑purpose vocabularies.
  • Model scope: Only a handful of open LLMs were evaluated; proprietary models (e.g., GPT‑4) could exhibit different head/tail dynamics.
  • Frequency estimation: Token‑frequency proxies derived from public corpora may not perfectly reflect the actual distribution in each model’s private training set.
  • Future directions suggested by the authors include extending TrackList to multilingual settings, exploring retrieval‑augmented generation as a remedy for tail‑knowledge gaps, and investigating curriculum‑learning strategies that explicitly balance head and tail exposure during fine‑tuning.

Authors

  • Ioana Buhnila
  • Aman Sinha
  • Mathieu Constant

Paper Information

  • arXiv ID: 2511.21006v1
  • Categories: cs.CL
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »