[Paper] CREATE: Testing LLMs for Associative Creativity
Source: arXiv - 2603.09970v1
Overview
The paper “CREATE: Testing LLMs for Associative Creativity” introduces a new benchmark that measures how well large language models (LLMs) can perform associative reasoning—the ability to link seemingly unrelated concepts in novel, meaningful ways. By turning a notoriously subjective skill into a concrete, automatically‑gradable task, the authors give researchers and engineers a practical playground for building more “creative” AI systems.
Key Contributions
- CREATE benchmark: A large‑scale, objectively scored dataset that asks models to generate multiple “paths” linking two concepts via intermediate ideas stored in the model’s parametric knowledge.
- Dual‑objective scoring: Paths are evaluated on specificity (how distinct and tightly connected the intermediate concepts are) and diversity (how different each path is from the others).
- Comprehensive evaluation of state‑of‑the‑art LLMs, including chain‑of‑thought (CoT) prompting, larger token budgets, and recent “creative prompting” techniques.
- Analysis of search complexity: Demonstrates that the combinatorial explosion of possible paths makes benchmark saturation hard, highlighting the need for smarter generation strategies.
- Open sandbox: The authors release the dataset and evaluation code, encouraging the community to experiment with new prompting, decoding, or fine‑tuning methods aimed at boosting associative creativity.
Methodology
- Task formulation – Given a pair of concepts (e.g., “rain” and “music”), the model must output a set of paths: sequences of intermediate concepts that connect the two ends (e.g., rain → puddle → drum → rhythm → music).
- Answer generation – Models are prompted to produce many candidate paths, with a generous token budget to allow exhaustive exploration. Different prompting styles are tested: vanilla generation, chain‑of‑thought, and “creative” prompts that explicitly ask for novelty.
- Scoring –
- Specificity: Measured by how closely each intermediate term relates to its neighbors (using lexical similarity and knowledge‑graph distance).
- Diversity: Computed as pairwise dissimilarity across all generated paths (e.g., Jaccard distance on the set of intermediate nodes).
- The final score rewards large collections of high‑specificity, high‑diversity paths.
- Benchmark construction – The authors curated thousands of concept pairs from commonsense resources and verified that a human evaluator can reliably judge the quality of generated paths, enabling automatic grading.
Results & Findings
| Model / Setting | Avg. CREATE Score | Observations |
|---|---|---|
| GPT‑4 (baseline) | 0.62 | Strong specificity but limited diversity; many paths overlap. |
| GPT‑4 + CoT | 0.64 | Slight boost in specificity; diversity unchanged. |
| GPT‑4 + Creative Prompt | 0.68 | Notable improvement in both dimensions, but still far from human performance. |
| Larger token budget (4×) | 0.66 | More paths generated, but diminishing returns on diversity. |
| Human annotators | 0.85 | Humans naturally produce a richer variety of connections. |
Key takeaways:
- Creative prompting helps, but the gap to human‑level associative reasoning remains sizable.
- Simply giving the model more tokens does not guarantee better diversity; smarter search (e.g., sampling strategies) is needed.
- The benchmark’s multiplicity of valid answers makes it resistant to “gaming”—models can’t just memorize a single correct path.
Practical Implications
- Idea generation tools: Developers building brainstorming assistants, product‑naming services, or hypothesis‑generation platforms can use CREATE as a validation suite to ensure their models produce genuinely novel suggestions rather than rehashing common tropes.
- Knowledge‑graph enrichment: Automated path creation can surface hidden relations for graph construction, improving downstream tasks like recommendation or semantic search.
- Prompt engineering: The findings highlight the importance of task‑specific prompts that explicitly ask for “unusual but plausible” connections, a useful pattern for any creative‑AI application.
- Evaluation pipeline: CREATE’s objective scoring can be integrated into CI pipelines for LLM fine‑tuning, giving engineers a quantitative signal for “creativity” alongside traditional accuracy metrics.
Limitations & Future Work
- Domain coverage: The benchmark focuses on everyday concepts; specialized domains (e.g., biomedical, legal) may require tailored path vocabularies.
- Scoring heuristics: Specificity and diversity rely on lexical and graph‑based proxies, which may not capture deeper semantic novelty.
- Model size bias: Larger models tend to generate more paths, but the metric does not fully normalize for capacity differences.
- Future directions suggested by the authors include: developing search‑guided decoding (e.g., Monte‑Carlo tree search), incorporating external knowledge bases to broaden the concept space, and extending the benchmark to multimodal creativity (linking images, code, or music).
Authors
- Manya Wadhwa
- Tiasa Singha Roy
- Harvey Lederman
- Junyi Jessy Li
- Greg Durrett
Paper Information
- arXiv ID: 2603.09970v1
- Categories: cs.CL
- Published: March 10, 2026
- PDF: Download PDF