[Paper] Do Generalisation Results Generalise?
Source: arXiv - 2512.07832v1
Overview
The paper asks a deceptively simple question: Do the out‑of‑distribution (OOD) generalisation results we report for large language models (LLMs) actually hold up across different OOD scenarios? By probing several OOD test sets during a single fine‑tuning run, the authors show that the correlation between a model’s performance on one OOD benchmark and another is far from consistent. In short, a model that looks great on one “hard” dataset may not be reliably robust elsewhere.
Key Contributions
- Multi‑benchmark OOD evaluation: Introduces a systematic protocol that measures a model’s performance on multiple OOD test sets throughout a fine‑tuning trajectory, rather than a single snapshot.
- Partial‑correlation analysis: Computes the correlation of OOD performances while controlling for in‑domain (ID) performance, isolating pure generalisation behaviour.
- Empirical findings on two state‑of‑the‑art LLM families (OLMo2 and OPT): Demonstrates that the sign and magnitude of OOD‑to‑OOD correlations vary dramatically across model sizes, training regimes, and dataset choices.
- Critical insight for benchmarking: Highlights that a single OOD benchmark cannot be taken as a universal proxy for robustness, urging the community to adopt broader evaluation suites.
Methodology
- Model selection & fine‑tuning: The authors pick two popular LLM families—OLMo2 and OPT—across several sizes. Each model is fine‑tuned on a standard in‑domain dataset (e.g., a language modelling or classification task).
- Checkpoint sampling: During fine‑tuning, they save model checkpoints at regular intervals (e.g., every few hundred steps). This yields a trajectory of models ranging from under‑trained to over‑trained.
- Multiple OOD test sets: For each checkpoint, they evaluate on a collection of OOD benchmarks that differ in domain shift type (topic shift, style shift, adversarial perturbations, etc.).
- Partial correlation computation: They calculate Pearson correlations between pairs of OOD test‑set scores, partialling out the in‑domain performance. This removes the confounding effect that a model simply getting better overall would improve all scores.
- Statistical analysis: Significance testing and visualisation (scatter plots, heatmaps) are used to interpret the correlation patterns across model families and sizes.
Results & Findings
- No universal OOD correlation: For many checkpoint pairs, the correlation between two OOD test sets is positive, but for others it is negative or near zero. The direction flips depending on the model family (OLMo2 vs. OPT) and even on model scale.
- In‑domain performance dominates raw OOD scores: When not controlling for ID performance, OOD scores appear highly correlated (as expected). The partial‑correlation step reveals that the apparent robustness is often just a by‑product of overall improvement.
- Model‑specific “robustness fingerprints”: Each model exhibits a distinct pattern of which OOD shifts it handles well together. For example, an OPT‑large checkpoint may simultaneously excel on topic‑shift and adversarial benchmarks, while an OLMo2‑small checkpoint shows a trade‑off between them.
- Fine‑tuning dynamics matter: Early checkpoints sometimes show stronger OOD‑to‑OOD alignment than later ones, suggesting that over‑fitting to the ID data can decouple robustness across shifts.
Practical Implications
- Broader evaluation pipelines: Teams deploying LLMs should benchmark against multiple OOD datasets rather than relying on a single “hard” test set. This reduces the risk of hidden brittleness in production.
- Model selection & checkpointing: The findings encourage monitoring OOD performance during fine‑tuning. Selecting a checkpoint that balances ID accuracy with consistent OOD behaviour may be more valuable than chasing the highest ID score.
- Robustness‑aware fine‑tuning strategies: Techniques such as multi‑task fine‑tuning, data augmentation, or regularisation could be tuned to improve the alignment of OOD behaviours, not just overall performance.
- Benchmark design: Researchers and dataset curators should aim for diverse OOD suites that capture orthogonal shift types (topic, style, noise, adversarial) to surface the nuanced robustness profiles highlighted in the paper.
Limitations & Future Work
- Scope of models: The study focuses on OLMo2 and OPT; extending the analysis to other architectures (e.g., LLaMA, GPT‑4) could reveal different patterns.
- Limited OOD domains: While the authors use several benchmarks, the space of possible distribution shifts is vast (multilingual, multimodal, code, etc.). More varied OOD sets would strengthen the conclusions.
- Partial correlation assumptions: Pearson’s linear correlation may miss non‑linear relationships between OOD performances. Future work could explore rank‑based or information‑theoretic measures.
- Intervention studies: The paper is observational; experiments that deliberately manipulate training data or regularisation to shape OOD‑to‑OOD correlations would provide actionable guidance for practitioners.
Authors
- Matteo Boglioni
- Andrea Sgobbi
- Gabriel Tavernini
- Francesco Rita
- Marius Mosbach
- Tiago Pimentel
Paper Information
- arXiv ID: 2512.07832v1
- Categories: cs.CL, cs.LG
- Published: December 8, 2025
- PDF: Download PDF