[Paper] The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving
Source: arXiv - 2601.00747v1
Overview
The paper The Reasoning‑Creativity Trade‑off: Toward Creativity‑Driven Problem Solving examines why modern large‑language‑model (LLM) pipelines that repeatedly “sample‑think‑refine” tend to lose creative spark as they chase correctness. By framing reasoning as a probability distribution over solution traces, the authors propose a unified variational objective—Distributional Creative Reasoning (DCR)—that simultaneously preserves answer quality and semantic diversity.
Key Contributions
- Unified theoretical framework (DCR): Shows that popular methods (STaR, GRPO, DPO, entropy bonuses, etc.) are special cases of a single variational loss over reasoning‑path distributions.
- Diversity Decay Theorem: Formal proof that correctness‑centric objectives inevitably contract the entropy of reasoning paths, with distinct decay patterns for STaR, GRPO, and DPO.
- Stability‑Diversity design recipe: Practical algorithmic tweaks (e.g., entropy‑regularized gradient flow, adaptive temperature scaling) that guarantee convergence to a policy that is both accurate and diverse.
- Empirical validation: Benchmarks on creative reasoning tasks (puzzle solving, open‑ended code generation, story continuation) demonstrate that DCR‑enhanced models retain higher semantic entropy while matching or exceeding baseline accuracy.
Methodology
- Trace‑level modeling: Each reasoning episode is represented as a trace—the ordered sequence of intermediate tokens or “thought steps.” The model’s policy defines a probability measure over all possible traces.
- Variational objective: DCR minimizes a KL‑type divergence between the model’s trace distribution and a target distribution that balances two forces:
- Correctness pressure (rewarding high‑scoring traces).
- Creativity pressure (entropy bonus encouraging spread across diverse traces).
- Gradient flow on measures: By treating the trace distribution as a continuous object, the authors derive a gradient‑flow update that can be implemented with standard back‑propagation plus a few extra terms (entropy gradient, adaptive temperature).
- Special‑case mapping: They mathematically show that setting the creativity weight to zero recovers STaR/GRPO/DPO, while adding a constant entropy term reproduces existing entropy‑bonus tricks.
Results & Findings
| Setting | Accuracy (↑) | Semantic Entropy (↑) | Diversity Score* |
|---|---|---|---|
| Baseline STaR | 84.2 % | 1.31 bits | 0.42 |
| GRPO (no entropy) | 85.0 % | 1.08 bits | 0.35 |
| DPO (reward‑only) | 84.7 % | 0.97 bits | 0.31 |
| DCR (proposed) | 85.3 % | 2.04 bits | 0.58 |
*Diversity Score = normalized pairwise trace‑distance.
Key takeaways
- Correctness is preserved – DCR matches or slightly exceeds the best baseline accuracy.
- Semantic entropy more than doubles, indicating a richer set of reasoning paths.
- Human evaluation on open‑ended code generation shows a 23 % increase in “novel yet functional” solutions.
Practical Implications
- Developer‑centric toolchains: Integrating DCR into existing “self‑refine” pipelines (e.g., OpenAI’s
function_callloops, LangChain agents) can yield assistants that propose multiple viable strategies instead of converging on a single “safe” answer. - Creative coding & debugging: For code‑generation models, higher trace diversity translates into alternative algorithmic approaches, aiding developers who need to explore trade‑offs (performance vs. readability).
- Product design & ideation: LLM‑powered brainstorming bots can maintain a steady flow of unconventional suggestions without sacrificing factual correctness, improving user engagement.
- Safety & alignment: By preventing mode collapse, DCR reduces the risk of over‑optimizing toward narrow reward proxies, a known source of unintended behavior.
Limitations & Future Work
- Computational overhead: Estimating entropy gradients adds ~15 % runtime compared with vanilla STaR; scaling to multi‑billion‑parameter models may require approximation tricks.
- Task scope: Experiments focus on reasoning‑heavy benchmarks; the benefits for short‑answer QA or pure classification tasks remain unclear.
- Hyper‑parameter sensitivity: The trade‑off weight between correctness and creativity needs careful tuning per domain; automated scheduling is an open problem.
- Future directions: The authors suggest (i) hierarchical trace representations to further reduce cost, (ii) curriculum‑style annealing of the creativity term, and (iii) extending DCR to multimodal reasoning (e.g., vision‑language agents).
Authors
- Max Ruiz Luyten
- Mihaela van der Schaar
Paper Information
- arXiv ID: 2601.00747v1
- Categories: cs.LG
- Published: January 2, 2026
- PDF: Download PDF