[Paper] Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks
Source: arXiv - 2512.21316v1
Overview
Ali Merali’s paper quantifies how the compute power behind large language models (LLMs) translates into real‑world productivity for knowledge‑work roles such as consulting, data analysis, and management. By running a large‑scale, preregistered experiment with more than 500 professionals, the study uncovers a clear “scaling law”: every additional year of AI model progress cuts task completion time by roughly 8 %, with the bulk of the gain coming from raw compute growth.
Key Contributions
- Empirical scaling law for economic impact – Derives a simple, interpretable relationship between LLM training compute and professional productivity.
- Large‑scale field experiment – 500+ participants across three job families used 13 different LLMs (varying in size, architecture, and training budget).
- Decomposition of gains – Shows that ~56 % of productivity improvement stems from increased compute, while ~44 % comes from algorithmic innovations (e.g., prompting strategies, fine‑tuning).
- Task‑type differentiation – Demonstrates that “non‑agentic” analytical tasks (e.g., report drafting, data summarization) reap far larger speedups than “agentic” workflows that require tool integration or multi‑step reasoning.
- Macro‑level projection – Estimates that continued model scaling could lift U.S. overall productivity by ~20 % over the next decade if adoption spreads across knowledge‑intensive occupations.
Methodology
- Participant recruitment – 527 professionals (consultants, data analysts, managers) were recruited via industry partners and compensated for completing a set of realistic work‑day tasks.
- Task design – Each participant performed three representative tasks:
- Consulting: drafting a client recommendation memo.
- Data analysis: cleaning a CSV, generating descriptive statistics, and writing a brief insight summary.
- Management: creating a project‑status dashboard and writing a concise update email.
- LLM conditions – Participants were randomly assigned to one of 13 LLMs ranging from ~1 B to ~175 B parameters, covering both open‑source and commercial offerings. Compute budgets (FLOP‑years) for each model were obtained from public documentation.
- Measurement – Task completion time was logged automatically; quality was assessed by blind expert reviewers using a rubric (clarity, correctness, relevance).
- Statistical analysis – A preregistered mixed‑effects regression modeled log‑task‑time as a function of log‑compute, controlling for participant skill, task difficulty, and model family. The regression coefficient on log‑compute yields the scaling exponent (≈ ‑0.08, i.e., 8 % time reduction per doubling of compute).
- Decomposition – By comparing models released in the same compute bracket but with newer architectures, the authors isolated algorithmic progress contributions.
Results & Findings
- Scaling exponent: Each doubling of training compute reduces average task time by ~8 % (p < 0.001).
- Compute vs. algorithmic share: 56 % of the total speedup is attributable to larger compute; 44 % to smarter training tricks, prompting, and fine‑tuning.
- Task‑type variance:
- Analytical (non‑agentic) tasks saw up to 12 % time reduction per compute doubling.
- Agentic tasks (requiring tool calls, multi‑step planning) only achieved ~4 % reduction, suggesting diminishing returns when external tool orchestration is needed.
- Quality trade‑off: Across all models, output quality remained statistically constant, indicating speed gains did not come at the expense of accuracy.
- Productivity projection: Assuming a 2× compute growth per year (consistent with recent trends) and steady adoption, the model predicts a cumulative ~20 % boost in U.S. knowledge‑worker productivity by 2035.
Practical Implications
- Tool selection: Companies can prioritize larger, compute‑heavy LLMs for tasks like report generation, data summarization, and internal documentation, where the ROI is highest.
- Workflow redesign: For agentic processes (e.g., automated spreadsheet manipulation, code generation), developers should invest in better orchestration layers (RAG pipelines, tool‑calling APIs) rather than relying solely on raw model size.
- Cost‑benefit modeling: The scaling law provides a quantitative basis for budgeting AI compute versus engineering effort—e.g., a 4× compute investment may yield a ~30 % time saving, which can be translated into labor cost reductions.
- Talent strategy: Upskilling staff in prompt engineering and model‑selection can capture a sizable portion (~44 %) of the productivity gains without additional hardware spend.
- Policy & investment: The macro‑level projection supports arguments for public and private funding in compute infrastructure, as the downstream economic impact could be comparable to traditional productivity‑enhancing technologies (e.g., broadband, ERP systems).
Limitations & Future Work
- Sample bias: Participants were self‑selected professionals comfortable with AI tools; results may overstate gains for less tech‑savvy workers.
- Task scope: The study focused on short, well‑defined tasks; longer‑horizon projects (e.g., strategic planning) could exhibit different scaling dynamics.
- Compute measurement granularity: Public FLOP estimates can be noisy; more precise accounting (including inference compute) would refine the scaling exponent.
- Tool integration: The modest gains for agentic tasks highlight a need for research on better tool‑calling frameworks and multimodal prompting.
- Longitudinal effects: Future work should track how productivity evolves as workers become more proficient with LLMs and as models continue to improve beyond the current compute frontier.
Authors
- Ali Merali
Paper Information
- arXiv ID: 2512.21316v1
- Categories: econ.GN, cs.AI, cs.HC
- Published: December 24, 2025
- PDF: Download PDF