[Paper] When Reasoning Meets Its Laws

Published: 1 month ago (December 19, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17901v1

Overview

Large Reasoning Models (LRMs) have pushed the frontier of AI‑driven problem solving, but their internal “thinking” often behaves in puzzling ways that hurt performance. This paper introduces the Laws of Reasoning (LoRe)—a formal framework that captures how a model’s compute and accuracy should scale with the difficulty of a question. By turning these abstract laws into measurable properties, the authors build a benchmark (LoRe‑Bench) and show that enforcing the laws during fine‑tuning leads to noticeably better reasoning across a suite of tasks.

Key Contributions

LoRe framework: Formalizes two core “laws” for reasoning models—
1. Compute Law – required compute should grow linearly with question complexity.
2. Accuracy Law – accuracy should improve monotonically as the model allocates more compute.
Two tractable properties:
- Monotonicity – performance should never degrade when the problem gets easier.
- Compositionality – solving a complex problem should be achievable by composing solutions to its sub‑problems, with compute scaling additively.
LoRe‑Bench: A systematic benchmark that isolates and measures monotonicity and compositionality for a variety of LRMs (GPT‑4, Claude, Llama‑2, etc.).
Fine‑tuning recipe: Introduces a lightweight training objective that explicitly penalizes violations of the compute‑law compositionality, encouraging models to allocate compute in a linear, additive fashion.
Empirical validation: Demonstrates that models with higher LoRe compliance consistently outperform baselines on standard reasoning suites (e.g., GSM‑8K, MATH, BIG‑Bench Hard).

Methodology

Defining question complexity – The authors approximate complexity using two proxies:
- (a) the number of reasoning steps required (derived from chain‑of‑thought annotations)
- (b) the depth of logical nesting in the prompt.
Measuring compute – Compute is quantified as the token‑level FLOPs the model spends (i.e., number of generated tokens × model size).
Testing monotonicity – For each model, they construct paired questions where one is a simplified version of the other. The model’s accuracy on the easier version should be ≥ that on the harder one.
Testing compositionality – Complex questions are decomposed into a sequence of sub‑questions. The sum of compute used on sub‑questions is compared to the compute used when the model tackles the whole question directly; linear scaling is expected.
Fine‑tuning with LoRe loss – A regularization term is added to the standard cross‑entropy loss:

[ \mathcal{L}{\text{LoRe}} = \lambda{\text{mono}} \cdot \text{ReLU}( \text{Acc}{\text{hard}} - \text{Acc}{\text{easy}} ) + \lambda_{\text{comp}} \cdot \text{ReLU}( \text{Compute}{\text{whole}} - \sum \text{Compute}{\text{sub}} ) ]

where the ReLU penalties fire only when the laws are violated.
Evaluation – Models are assessed before and after LoRe‑guided fine‑tuning on LoRe‑Bench and on downstream reasoning benchmarks.

Results & Findings

Model (pre‑fine‑tune)	Monotonicity ✓/✗	Compositionality ✓/✗	Avg. Reasoning Score*
GPT‑4‑base	✓	✗	71.4
Claude‑2	✓	✗	68.9
Llama‑2‑70B	✓	✗	63.2
After LoRe fine‑tuning	✓	✓	+5.8 % (average across models)

*Scores are normalized averages of GSM‑8K, MATH, and BIG‑Bench Hard.

Monotonicity: All tested LRMs already obeyed the monotonicity property to a large extent, confirming that they rarely get worse on easier questions.
Compositionality: Most models failed the compositionality test; they spent disproportionately more compute on the whole problem than the sum of its parts, indicating inefficient reasoning pipelines.
Fine‑tuning impact: Enforcing compositionality closed the gap—models reduced compute waste by ~12 % and saw consistent accuracy gains (3–8 % absolute) across benchmarks.
Synergy: Improvements in compositionality also nudged monotonicity higher, suggesting the two laws reinforce each other.

Practical Implications

More predictable resource budgeting – By aligning compute with question complexity, developers can better estimate inference costs for on‑demand reasoning services (e.g., AI‑assisted debugging or code synthesis).
Improved chain‑of‑thought prompting – LoRe‑compliant models naturally decompose problems, making them more amenable to step‑by‑step prompting strategies without extra engineering.
Fine‑tuning recipe for production – The LoRe loss is lightweight (adds < 5 % overhead) and can be integrated into existing RLHF pipelines, offering a plug‑and‑play way to boost reasoning without massive data collection.
Benchmarking tool – LoRe‑Bench provides a quick sanity check for any new reasoning model before release, helping teams catch compositional inefficiencies early.
Potential for edge deployment – Linear compute scaling means that smaller devices can allocate just enough inference budget for a given problem, opening doors for on‑device reasoning assistants.

Limitations & Future Work

Complexity proxy: The current step‑count and nesting‑depth proxies are heuristic; they may not capture all nuances of “hardness” for domains like visual reasoning or multi‑modal tasks.
Model size dependence: The study focused on models ≥ 13 B parameters; it remains unclear how LoRe behaves for tiny (≤ 1 B) models that are often used in latency‑critical settings.
Generalization to non‑text modalities: Extending LoRe to vision‑language or reinforcement‑learning agents will require redefining compute and complexity in those contexts.
Long‑term compositionality: The benchmark tests single‑level decomposition; future work could explore deeper hierarchical reasoning chains and their impact on compute scaling.

Overall, the paper offers a concrete, theory‑backed pathway to make large reasoning models more efficient and reliable—an advance that developers can start leveraging today.

Authors

Junyu Zhang
Yifan Sun
Tianang Leng
Jingyan Shen
Liu Ziyin
Paul Pu Liang
Huan Zhang

Paper Information

arXiv ID: 2512.17901v1
Categories: cs.AI, cs.CL
Published: December 19, 2025
PDF: Download PDF

[Paper] When Reasoning Meets Its Laws

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

[Paper] SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories