[Paper] Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

Published: 1 month ago (December 12, 2025 at 07:14 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11509v1

Overview

The paper “Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs” asks a surprisingly practical question: when we tame the tendency of large language models (LLMs) to “hallucinate” (i.e., generate false facts), do we also dampen their ability to produce novel, creative ideas? By testing three popular hallucination‑reduction techniques across several model families, the authors reveal that the answer depends on the method you pick—information that matters for anyone building AI‑assisted research tools, brainstorming assistants, or creative coding helpers.

Key Contributions

Systematic comparison of three hallucination‑reduction strategies—Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval‑Augmented Generation (RAG)—on creativity.
Broad experimental coverage: three LLM families (LLaMA, Qwen, Mistral) spanning 1 B to 70 B parameters.
Dual benchmark evaluation using NeoCoder (code‑generation creativity) and CS4 (open‑ended creative writing).
Empirical finding that hallucination‑reduction methods have opposing effects on divergent creativity: CoVe boosts it, DoLa suppresses it, while RAG is largely neutral.
Practical guidance for developers building AI‑assisted scientific discovery pipelines where factual correctness and creative hypothesis generation must coexist.

Methodology

Hallucination‑reduction techniques
- Chain of Verification (CoVe): the model first generates an answer, then runs a verification chain (self‑questioning, fact‑checking) before outputting the final result.
- Decoding by Contrasting Layers (DoLa): modifies the decoding process by contrasting hidden‑state representations from early vs. later transformer layers, encouraging “conservative” token choices.
- Retrieval‑Augmented Generation (RAG): augments the prompt with top‑k relevant documents fetched from an external knowledge base, grounding the generation.
Model families & scales
- LLaMA, Qwen, Mistral – each evaluated at 1 B, 7 B, 13 B, 30 B, and 70 B (where available).
Creativity benchmarks
- NeoCoder: prompts that require generating novel code snippets (e.g., “Write a function that solves a new variant of the traveling‑salesperson problem”).
- CS4: open‑ended story/idea prompts designed to measure divergent thinking (multiple plausible continuations, originality scores).
Metrics
- Hallucination rate: measured via automatic fact‑checking against a gold knowledge base and human verification.
- Creativity: scored using standard divergent‑thinking metrics—fluency, originality, flexibility—computed from n‑gram diversity, semantic novelty, and human raters.
Experimental protocol
- For each model‑technique pair, generate 500 responses per benchmark.
- Compute hallucination reduction relative to a baseline (plain decoding).
- Compare creativity scores across techniques while controlling for model size.

Results & Findings

Technique	Hallucination ↓	NeoCoder Creativity ↑	CS4 Creativity ↑
CoVe	‑28 % (vs. baseline)	+12 % (significant)	+9 %
DoLa	‑22 %	‑15 % (significant drop)	‑13 %
RAG	‑25 %	±1 % (no statistical change)	±2 %

CoVe not only cuts hallucinations but stimulates divergent thinking. The verification chain appears to act like a “self‑reflection” step, prompting the model to explore alternative phrasings before committing.
DoLa reduces hallucinations at the cost of creativity; the layer‑contrast decoding pushes the model toward safer, more “canonical” token choices, limiting novelty.
RAG grounds the model without noticeably affecting its creative breadth—useful when you need factual grounding but still want the model’s imagination to roam.
The effects are consistent across model families and scale, though larger models (≥30 B) show a slightly muted drop in creativity for DoLa, suggesting size can partially compensate.

Practical Implications

AI‑assisted research tools (e.g., hypothesis generators, literature‑review assistants) can adopt CoVe when the workflow benefits from both factual checks and creative leaps—think “generate a plausible but novel mechanism, then verify against known chemistry.”
Code‑generation platforms that need reliable yet inventive snippets (e.g., auto‑completion for novel algorithms) may also favor CoVe, as it improves correctness while encouraging out‑of‑the‑box solutions.
Safety‑critical applications (medical advice, legal drafting) where hallucinations are intolerable should consider DoLa or RAG. DoLa is a good fit when you can tolerate a modest loss in creative richness; RAG is ideal when you want grounding without sacrificing the model’s imaginative capacity.
Product design: developers can expose a “creativity‑vs‑accuracy” toggle that internally swaps between CoVe, DoLa, and RAG, giving end‑users control over the trade‑off.
Prompt engineering: the findings suggest that adding a verification step (even a lightweight one) can be a cheap way to boost both accuracy and novelty, without needing extra retrieval infrastructure.

Limitations & Future Work

Domain coverage: Benchmarks focus on code and open‑ended writing; scientific domains (biology, physics) may behave differently.
Verification quality: CoVe’s verification chain relies on the model’s own self‑assessment, which can still be biased; external fact‑checkers were not explored.
Scalability: CoVe adds extra inference passes, increasing latency—future work could streamline verification for real‑time applications.
User studies: The paper measures creativity via automated metrics and expert raters; real‑world user satisfaction and downstream impact (e.g., successful hypothesis generation) remain to be validated.
Hybrid approaches: Combining RAG’s grounding with CoVe’s self‑verification could yield even better balances; the authors propose exploring such pipelines next.

Authors

Mohor Banerjee
Nadya Yuki Wangsajaya
Syed Ali Redha Alsagoff
Min Sen Tan
Zachary Choy Kit Chun
Alvin Chan Guo Wei

Paper Information

arXiv ID: 2512.11509v1
Categories: cs.CL, cs.AI
Published: December 12, 2025
PDF: Download PDF

[Paper] Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models