[Paper] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Published: 2 months ago (December 9, 2025 at 01:33 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08894v1

Overview

The authors revisit a long‑standing assumption in LLM research: that scaling laws derived from pre‑training loss are poor predictors of downstream task performance. By directly modeling how benchmark accuracy scales with the total training budget, they demonstrate that a simple power‑law relationship can reliably forecast downstream results across a range of model sizes and token counts. This finding reshapes how practitioners can plan compute budgets and anticipate real‑world performance without costly trial‑and‑error experiments.

Key Contributions

Direct scaling law for downstream metrics – Shows that log‑accuracy on several popular benchmarks follows a clean power‑law with respect to the training budget (tokens × parameters).
Empirical validation across scales – Experiments on models from 125 M to 17 B parameters trained on up to 350 B tokens, covering two distinct data mixtures.
Comparison with two‑stage approach – Demonstrates that the direct method extrapolates more accurately than the traditional pipeline (pre‑training loss → downstream prediction).
Extended functional forms – Introduces formulas that incorporate token‑to‑parameter ratios and inference compute (e.g., repeated sampling) to predict accuracy under different deployment scenarios.
Open data release – Publishes the full set of pre‑training loss curves and downstream evaluation results, enabling reproducibility and further research.

Methodology

Training budget as the independent variable – The authors treat the product of model parameters (P) and total training tokens (T) as a single “budget” variable, (B = P \times T).
Power‑law fitting – For each downstream benchmark they fit a relation of the form
[ \log(\text{accuracy}) = a \cdot \log(B) + b ]
where a and b are learned coefficients.
Cross‑validation across token‑to‑parameter ratios – They repeat the fitting for several fixed ratios (r = T/P) to verify that the law holds when the ratio changes.
Inference‑compute extension – By modeling repeated sampling (e.g., temperature‑based decoding or ensemble voting) they add a term that captures extra inference FLOPs, yielding a more general prediction surface.
Baseline comparison – The classic two‑stage pipeline first predicts pre‑training loss from the budget, then maps loss to downstream accuracy. The authors replicate this pipeline and compare extrapolation error against their direct method.

All steps rely on ordinary least‑squares regression; no exotic optimization or reinforcement learning tricks are required, making the approach easy to reproduce.

Results & Findings

Metric	Direct Power‑Law (this work)	Two‑Stage Baseline
Mean absolute error on held‑out downstream accuracy (across 5 benchmarks)	≈ 1.2 %	≈ 3.8 %
Extrapolation to 17 B‑parameter models (unseen during fitting)	Within 1 % of actual accuracy	Over‑estimates by 4–6 %
Sensitivity to token‑to‑parameter ratio	Captured by a simple additive term; predictions stay within 2 % across ratios 10–1000	Errors grow >5 % when ratio deviates from training points

Key takeaways

Log‑accuracy scales linearly with log‑budget for the tasks examined (e.g., BoolQ, RTE, SST‑2, etc.).
The direct model’s extrapolation error remains low even when predicting performance for models 10× larger than any training point.
Incorporating inference compute yields a smooth trade‑off curve that matches empirical results from temperature‑scaled sampling and majority‑vote ensembles.

Practical Implications

Budget‑driven model selection – Teams can now estimate the downstream accuracy they’ll achieve for a given compute budget before committing to expensive training runs.
Rapid prototyping – By fitting a few small‑scale experiments, developers can forecast the performance of much larger models, reducing iteration cycles.
Cost‑effective inference planning – The extended formula helps decide whether to invest extra inference FLOPs (e.g., more sampling steps) versus scaling the model size.
Dataset‑mix decisions – Since the authors test two data mixtures, the methodology can be reused to compare the downstream payoff of different pre‑training corpora without full‑scale runs.
Product road‑mapping – Companies can align roadmap milestones (e.g., “reach 90 % accuracy on X benchmark by Q3”) with concrete compute allocations, improving transparency with stakeholders.

Limitations & Future Work

Task coverage – The study focuses on a handful of classification and reasoning benchmarks; scaling behavior for generation‑heavy tasks (e.g., code synthesis, long‑form QA) remains untested.
Model architecture variance – All experiments use a standard decoder‑only transformer; it is unclear whether the same power‑law holds for encoder‑decoder or mixture‑of‑experts models.
Data quality effects – While two dataset mixtures are examined, the impact of data cleaning, tokenization strategies, or domain‑specific corpora on the scaling law is not fully explored.
Beyond power‑law – At extreme scales (hundreds of billions of parameters) the linear log‑log relationship may saturate; future work could investigate asymptotic regimes or incorporate saturation terms.

The authors invite the community to extend the dataset, test additional tasks, and refine the functional forms, paving the way for more reliable, budget‑aware LLM development.

Authors

Jakub Krajewski
Amitis Shidani
Dan Busbridge
Sam Wiseman
Jason Ramapuram

Paper Information

arXiv ID: 2512.08894v1
Categories: cs.LG, cs.AI, cs.CL
Published: December 9, 2025
PDF: Download PDF

[Paper] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models