[Paper] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
Source: arXiv - 2512.08894v1
Overview
The authors revisit a long‑standing assumption in LLM research: that scaling laws derived from pre‑training loss are poor predictors of downstream task performance. By directly modeling how benchmark accuracy scales with the total training budget, they demonstrate that a simple power‑law relationship can reliably forecast downstream results across a range of model sizes and token counts. This finding reshapes how practitioners can plan compute budgets and anticipate real‑world performance without costly trial‑and‑error experiments.
Key Contributions
- Direct scaling law for downstream metrics – Shows that log‑accuracy on several popular benchmarks follows a clean power‑law with respect to the training budget (tokens × parameters).
- Empirical validation across scales – Experiments on models from 125 M to 17 B parameters trained on up to 350 B tokens, covering two distinct data mixtures.
- Comparison with two‑stage approach – Demonstrates that the direct method extrapolates more accurately than the traditional pipeline (pre‑training loss → downstream prediction).
- Extended functional forms – Introduces formulas that incorporate token‑to‑parameter ratios and inference compute (e.g., repeated sampling) to predict accuracy under different deployment scenarios.
- Open data release – Publishes the full set of pre‑training loss curves and downstream evaluation results, enabling reproducibility and further research.
Methodology
- Training budget as the independent variable – The authors treat the product of model parameters (P) and total training tokens (T) as a single “budget” variable, (B = P \times T).
- Power‑law fitting – For each downstream benchmark they fit a relation of the form
[ \log(\text{accuracy}) = a \cdot \log(B) + b ]
where a and b are learned coefficients. - Cross‑validation across token‑to‑parameter ratios – They repeat the fitting for several fixed ratios (r = T/P) to verify that the law holds when the ratio changes.
- Inference‑compute extension – By modeling repeated sampling (e.g., temperature‑based decoding or ensemble voting) they add a term that captures extra inference FLOPs, yielding a more general prediction surface.
- Baseline comparison – The classic two‑stage pipeline first predicts pre‑training loss from the budget, then maps loss to downstream accuracy. The authors replicate this pipeline and compare extrapolation error against their direct method.
All steps rely on ordinary least‑squares regression; no exotic optimization or reinforcement learning tricks are required, making the approach easy to reproduce.
Results & Findings
| Metric | Direct Power‑Law (this work) | Two‑Stage Baseline |
|---|---|---|
| Mean absolute error on held‑out downstream accuracy (across 5 benchmarks) | ≈ 1.2 % | ≈ 3.8 % |
| Extrapolation to 17 B‑parameter models (unseen during fitting) | Within 1 % of actual accuracy | Over‑estimates by 4–6 % |
| Sensitivity to token‑to‑parameter ratio | Captured by a simple additive term; predictions stay within 2 % across ratios 10–1000 | Errors grow >5 % when ratio deviates from training points |
Key takeaways
- Log‑accuracy scales linearly with log‑budget for the tasks examined (e.g., BoolQ, RTE, SST‑2, etc.).
- The direct model’s extrapolation error remains low even when predicting performance for models 10× larger than any training point.
- Incorporating inference compute yields a smooth trade‑off curve that matches empirical results from temperature‑scaled sampling and majority‑vote ensembles.
Practical Implications
- Budget‑driven model selection – Teams can now estimate the downstream accuracy they’ll achieve for a given compute budget before committing to expensive training runs.
- Rapid prototyping – By fitting a few small‑scale experiments, developers can forecast the performance of much larger models, reducing iteration cycles.
- Cost‑effective inference planning – The extended formula helps decide whether to invest extra inference FLOPs (e.g., more sampling steps) versus scaling the model size.
- Dataset‑mix decisions – Since the authors test two data mixtures, the methodology can be reused to compare the downstream payoff of different pre‑training corpora without full‑scale runs.
- Product road‑mapping – Companies can align roadmap milestones (e.g., “reach 90 % accuracy on X benchmark by Q3”) with concrete compute allocations, improving transparency with stakeholders.
Limitations & Future Work
- Task coverage – The study focuses on a handful of classification and reasoning benchmarks; scaling behavior for generation‑heavy tasks (e.g., code synthesis, long‑form QA) remains untested.
- Model architecture variance – All experiments use a standard decoder‑only transformer; it is unclear whether the same power‑law holds for encoder‑decoder or mixture‑of‑experts models.
- Data quality effects – While two dataset mixtures are examined, the impact of data cleaning, tokenization strategies, or domain‑specific corpora on the scaling law is not fully explored.
- Beyond power‑law – At extreme scales (hundreds of billions of parameters) the linear log‑log relationship may saturate; future work could investigate asymptotic regimes or incorporate saturation terms.
The authors invite the community to extend the dataset, test additional tasks, and refine the functional forms, paving the way for more reliable, budget‑aware LLM development.
Authors
- Jakub Krajewski
- Amitis Shidani
- Dan Busbridge
- Sam Wiseman
- Jason Ramapuram
Paper Information
- arXiv ID: 2512.08894v1
- Categories: cs.LG, cs.AI, cs.CL
- Published: December 9, 2025
- PDF: Download PDF