[Paper] On the Origin of Algorithmic Progress in AI

Published: 2 months ago (November 26, 2025 at 12:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21622v1

Overview

The paper investigates why AI training has become dramatically more compute‑efficient over the past decade. By dissecting a suite of historic algorithmic innovations and running large‑scale scaling experiments, the authors show that most of the observed 22,000× boost in FLOP efficiency cannot be explained by “static” algorithmic improvements alone. Instead, the bulk of the gains stem from scale‑dependent efficiency—most notably the shift from LSTMs to Transformers—which dramatically changes how compute translates into model performance as model size grows.

Key Contributions

Quantitative audit of historic algorithmic gains: Small‑scale ablations of well‑known innovations (e.g., residual connections, layer normalization) explain < 10× of the total efficiency increase.
Literature‑wide estimate of missing gains: Survey of additional papers suggests another < 10× contribution, still far short of the reported 22,000×.
Scaling‑law experiments: Direct comparison of LSTM and Transformer families across many compute budgets reveals distinct compute‑optimal scaling exponents.
Scale‑dependent efficiency model: Demonstrates that algorithmic progress is not a fixed multiplier but varies with model size, accounting for ~ 6,900× of the total gain.
Reinterpretation of “algorithmic progress”: Argues that efficiency metrics are heavily reference‑dependent and that small‑model improvements have been modest.

Methodology

Ablation Benchmarks: The authors re‑implemented a set of canonical architectural tweaks (e.g., attention mechanisms, normalization layers) and measured their FLOP‑to‑accuracy trade‑offs on standard NLP/vision tasks.
Literature Survey: They collected reported efficiency gains from 2012‑2023 papers, extracting rough multiplicative improvements for each innovation not covered by the ablations.
Scaling Experiments: Using identical training pipelines, they trained families of LSTM models and Transformer models across a wide range of compute budgets (from 10⁹ to 10¹⁴ FLOPs). For each family they fit the empirical compute‑optimal scaling law ( \text{Performance} \propto \text{Compute}^{\alpha} ) and compared the exponents ( \alpha_{\text{LSTM}} ) vs. ( \alpha_{\text{Transformer}} ).
Extrapolation: By integrating the measured scaling exponents with the historical growth of compute budgets, they estimated the cumulative efficiency gain attributable to the LSTM→Transformer transition.

All experiments were run on publicly available hardware (GPU clusters) and the code is released for reproducibility.

Results & Findings

Aspect	Finding
Static algorithmic gains	< 10× from ablations; < 10× from literature survey; total < 100×.
Scaling exponent difference	LSTMs: ( \alpha \approx 0.45 ); Transformers: ( \alpha \approx 0.65 ). The higher exponent means Transformers extract more performance per additional FLOP as models get larger.
Cumulative efficiency	When accounting for the exponential growth of compute budgets (≈ 10⁴× increase from 2012‑2023), the scale‑dependent advantage of Transformers translates to ≈ 6,930× overall FLOP‑efficiency gain.
Dominant source of progress	The LSTM‑to‑Transformer transition alone explains the majority (> 90%) of the observed efficiency improvement.
Other innovations	Most other architectural tweaks (e.g., residual connections, attention variants) show near‑identical scaling exponents, contributing only marginally to long‑term efficiency.

Practical Implications

Model selection for budget‑constrained projects: When planning to train large models, the scaling exponent matters more than raw architectural tweaks. Choosing a Transformer‑based family can yield far better returns on compute investment than iterating on LSTM‑style designs.
Hardware‑aware roadmap planning: Companies that forecast compute budgets (e.g., for next‑gen GPUs/TPUs) should factor in the scale‑dependent nature of algorithmic progress; a 2× increase in hardware may translate to > 2× performance gains if the algorithm’s exponent is high.
Benchmarking standards: Current “FLOP‑efficiency” benchmarks that treat algorithms as static multipliers may mislead developers. Reporting performance as a function of compute (scaling curves) provides a more actionable metric.
Research focus: Efforts that aim to improve small‑model efficiency (e.g., pruning, quantization) may have limited impact on the overall trajectory of AI progress unless they also shift the scaling exponent.
Tooling and AutoML: AutoML pipelines that search over model families should incorporate scaling‑law predictions to prioritize families with steeper exponents for large‑scale deployments.

Limitations & Future Work

Task diversity: The scaling experiments focus primarily on language modeling and a few vision benchmarks; other domains (reinforcement learning, speech) may exhibit different exponent dynamics.
Hardware heterogeneity: All experiments were run on GPUs; scaling behavior could vary on specialized ASICs or future architectures.
Long‑tail innovations: The paper acknowledges that many niche algorithmic ideas (e.g., sparsity, mixture‑of‑experts) were not fully explored and could affect scaling at extreme compute levels.
Extrapolation risk: Predicting efficiency gains far beyond observed compute budgets assumes the scaling law remains stable, which may break down with new paradigms (e.g., neuromorphic computing).

Future work could extend the scaling‑law analysis to a broader set of model families, incorporate hardware‑specific factors, and explore whether novel algorithmic directions can increase the scaling exponent rather than just shift the constant factor.

Authors

Hans Gundlach
Alex Fogelson
Jayson Lynch
Ana Trisovic
Jonathan Rosenfeld
Anmol Sandhu
Neil Thompson

Paper Information

arXiv ID: 2511.21622v1
Categories: cs.LG, cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] On the Origin of Algorithmic Progress in AI

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval