[Paper] CREATE: Testing LLMs for Associative Creativity

Published: 13 hours ago (March 10, 2026 at 01:58 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.09970v1

Overview

The paper “CREATE: Testing LLMs for Associative Creativity” introduces a new benchmark that measures how well large language models (LLMs) can perform associative reasoning—the ability to link seemingly unrelated concepts in novel, meaningful ways. By turning a notoriously subjective skill into a concrete, automatically‑gradable task, the authors give researchers and engineers a practical playground for building more “creative” AI systems.

Key Contributions

CREATE benchmark: A large‑scale, objectively scored dataset that asks models to generate multiple “paths” linking two concepts via intermediate ideas stored in the model’s parametric knowledge.
Dual‑objective scoring: Paths are evaluated on specificity (how distinct and tightly connected the intermediate concepts are) and diversity (how different each path is from the others).
Comprehensive evaluation of state‑of‑the‑art LLMs, including chain‑of‑thought (CoT) prompting, larger token budgets, and recent “creative prompting” techniques.
Analysis of search complexity: Demonstrates that the combinatorial explosion of possible paths makes benchmark saturation hard, highlighting the need for smarter generation strategies.
Open sandbox: The authors release the dataset and evaluation code, encouraging the community to experiment with new prompting, decoding, or fine‑tuning methods aimed at boosting associative creativity.

Methodology

Task formulation – Given a pair of concepts (e.g., “rain” and “music”), the model must output a set of paths: sequences of intermediate concepts that connect the two ends (e.g., rain → puddle → drum → rhythm → music).
Answer generation – Models are prompted to produce many candidate paths, with a generous token budget to allow exhaustive exploration. Different prompting styles are tested: vanilla generation, chain‑of‑thought, and “creative” prompts that explicitly ask for novelty.
Scoring –
- Specificity: Measured by how closely each intermediate term relates to its neighbors (using lexical similarity and knowledge‑graph distance).
- Diversity: Computed as pairwise dissimilarity across all generated paths (e.g., Jaccard distance on the set of intermediate nodes).
- The final score rewards large collections of high‑specificity, high‑diversity paths.
Benchmark construction – The authors curated thousands of concept pairs from commonsense resources and verified that a human evaluator can reliably judge the quality of generated paths, enabling automatic grading.

Results & Findings

Model / Setting	Avg. CREATE Score	Observations
GPT‑4 (baseline)	0.62	Strong specificity but limited diversity; many paths overlap.
GPT‑4 + CoT	0.64	Slight boost in specificity; diversity unchanged.
GPT‑4 + Creative Prompt	0.68	Notable improvement in both dimensions, but still far from human performance.
Larger token budget (4×)	0.66	More paths generated, but diminishing returns on diversity.
Human annotators	0.85	Humans naturally produce a richer variety of connections.

Key takeaways:

Creative prompting helps, but the gap to human‑level associative reasoning remains sizable.
Simply giving the model more tokens does not guarantee better diversity; smarter search (e.g., sampling strategies) is needed.
The benchmark’s multiplicity of valid answers makes it resistant to “gaming”—models can’t just memorize a single correct path.

Practical Implications

Idea generation tools: Developers building brainstorming assistants, product‑naming services, or hypothesis‑generation platforms can use CREATE as a validation suite to ensure their models produce genuinely novel suggestions rather than rehashing common tropes.
Knowledge‑graph enrichment: Automated path creation can surface hidden relations for graph construction, improving downstream tasks like recommendation or semantic search.
Prompt engineering: The findings highlight the importance of task‑specific prompts that explicitly ask for “unusual but plausible” connections, a useful pattern for any creative‑AI application.
Evaluation pipeline: CREATE’s objective scoring can be integrated into CI pipelines for LLM fine‑tuning, giving engineers a quantitative signal for “creativity” alongside traditional accuracy metrics.

Limitations & Future Work

Domain coverage: The benchmark focuses on everyday concepts; specialized domains (e.g., biomedical, legal) may require tailored path vocabularies.
Scoring heuristics: Specificity and diversity rely on lexical and graph‑based proxies, which may not capture deeper semantic novelty.
Model size bias: Larger models tend to generate more paths, but the metric does not fully normalize for capacity differences.
Future directions suggested by the authors include: developing search‑guided decoding (e.g., Monte‑Carlo tree search), incorporating external knowledge bases to broaden the concept space, and extending the benchmark to multimodal creativity (linking images, code, or music).

Authors

Manya Wadhwa
Tiasa Singha Roy
Harvey Lederman
Junyi Jessy Li
Greg Durrett

Paper Information

arXiv ID: 2603.09970v1
Categories: cs.CL
Published: March 10, 2026
PDF: Download PDF

[Paper] CREATE: Testing LLMs for Associative Creativity

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Think Before You Lie: How Reasoning Improves Honesty

[Paper] Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

[Paper] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

[Paper] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning