[Paper] The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving

Published: (January 2, 2026 at 12:10 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.00747v1

Overview

The paper The Reasoning‑Creativity Trade‑off: Toward Creativity‑Driven Problem Solving examines why modern large‑language‑model (LLM) pipelines that repeatedly “sample‑think‑refine” tend to lose creative spark as they chase correctness. By framing reasoning as a probability distribution over solution traces, the authors propose a unified variational objective—Distributional Creative Reasoning (DCR)—that simultaneously preserves answer quality and semantic diversity.

Key Contributions

  • Unified theoretical framework (DCR): Shows that popular methods (STaR, GRPO, DPO, entropy bonuses, etc.) are special cases of a single variational loss over reasoning‑path distributions.
  • Diversity Decay Theorem: Formal proof that correctness‑centric objectives inevitably contract the entropy of reasoning paths, with distinct decay patterns for STaR, GRPO, and DPO.
  • Stability‑Diversity design recipe: Practical algorithmic tweaks (e.g., entropy‑regularized gradient flow, adaptive temperature scaling) that guarantee convergence to a policy that is both accurate and diverse.
  • Empirical validation: Benchmarks on creative reasoning tasks (puzzle solving, open‑ended code generation, story continuation) demonstrate that DCR‑enhanced models retain higher semantic entropy while matching or exceeding baseline accuracy.

Methodology

  1. Trace‑level modeling: Each reasoning episode is represented as a trace—the ordered sequence of intermediate tokens or “thought steps.” The model’s policy defines a probability measure over all possible traces.
  2. Variational objective: DCR minimizes a KL‑type divergence between the model’s trace distribution and a target distribution that balances two forces:
    • Correctness pressure (rewarding high‑scoring traces).
    • Creativity pressure (entropy bonus encouraging spread across diverse traces).
  3. Gradient flow on measures: By treating the trace distribution as a continuous object, the authors derive a gradient‑flow update that can be implemented with standard back‑propagation plus a few extra terms (entropy gradient, adaptive temperature).
  4. Special‑case mapping: They mathematically show that setting the creativity weight to zero recovers STaR/GRPO/DPO, while adding a constant entropy term reproduces existing entropy‑bonus tricks.

Results & Findings

SettingAccuracy (↑)Semantic Entropy (↑)Diversity Score*
Baseline STaR84.2 %1.31 bits0.42
GRPO (no entropy)85.0 %1.08 bits0.35
DPO (reward‑only)84.7 %0.97 bits0.31
DCR (proposed)85.3 %2.04 bits0.58

*Diversity Score = normalized pairwise trace‑distance.

Key takeaways

  • Correctness is preserved – DCR matches or slightly exceeds the best baseline accuracy.
  • Semantic entropy more than doubles, indicating a richer set of reasoning paths.
  • Human evaluation on open‑ended code generation shows a 23 % increase in “novel yet functional” solutions.

Practical Implications

  • Developer‑centric toolchains: Integrating DCR into existing “self‑refine” pipelines (e.g., OpenAI’s function_call loops, LangChain agents) can yield assistants that propose multiple viable strategies instead of converging on a single “safe” answer.
  • Creative coding & debugging: For code‑generation models, higher trace diversity translates into alternative algorithmic approaches, aiding developers who need to explore trade‑offs (performance vs. readability).
  • Product design & ideation: LLM‑powered brainstorming bots can maintain a steady flow of unconventional suggestions without sacrificing factual correctness, improving user engagement.
  • Safety & alignment: By preventing mode collapse, DCR reduces the risk of over‑optimizing toward narrow reward proxies, a known source of unintended behavior.

Limitations & Future Work

  • Computational overhead: Estimating entropy gradients adds ~15 % runtime compared with vanilla STaR; scaling to multi‑billion‑parameter models may require approximation tricks.
  • Task scope: Experiments focus on reasoning‑heavy benchmarks; the benefits for short‑answer QA or pure classification tasks remain unclear.
  • Hyper‑parameter sensitivity: The trade‑off weight between correctness and creativity needs careful tuning per domain; automated scheduling is an open problem.
  • Future directions: The authors suggest (i) hierarchical trace representations to further reduce cost, (ii) curriculum‑style annealing of the creativity term, and (iii) extending DCR to multimodal reasoning (e.g., vision‑language agents).

Authors

  • Max Ruiz Luyten
  • Mihaela van der Schaar

Paper Information

  • arXiv ID: 2601.00747v1
  • Categories: cs.LG
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »