[Paper] CREATE: Testing LLMs for Associative Creativity

Published: (March 10, 2026 at 01:58 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.09970v1

Overview

The paper “CREATE: Testing LLMs for Associative Creativity” introduces a new benchmark that measures how well large language models (LLMs) can perform associative reasoning—the ability to link seemingly unrelated concepts in novel, meaningful ways. By turning a notoriously subjective skill into a concrete, automatically‑gradable task, the authors give researchers and engineers a practical playground for building more “creative” AI systems.

Key Contributions

  • CREATE benchmark: A large‑scale, objectively scored dataset that asks models to generate multiple “paths” linking two concepts via intermediate ideas stored in the model’s parametric knowledge.
  • Dual‑objective scoring: Paths are evaluated on specificity (how distinct and tightly connected the intermediate concepts are) and diversity (how different each path is from the others).
  • Comprehensive evaluation of state‑of‑the‑art LLMs, including chain‑of‑thought (CoT) prompting, larger token budgets, and recent “creative prompting” techniques.
  • Analysis of search complexity: Demonstrates that the combinatorial explosion of possible paths makes benchmark saturation hard, highlighting the need for smarter generation strategies.
  • Open sandbox: The authors release the dataset and evaluation code, encouraging the community to experiment with new prompting, decoding, or fine‑tuning methods aimed at boosting associative creativity.

Methodology

  1. Task formulation – Given a pair of concepts (e.g., “rain” and “music”), the model must output a set of paths: sequences of intermediate concepts that connect the two ends (e.g., rain → puddle → drum → rhythm → music).
  2. Answer generation – Models are prompted to produce many candidate paths, with a generous token budget to allow exhaustive exploration. Different prompting styles are tested: vanilla generation, chain‑of‑thought, and “creative” prompts that explicitly ask for novelty.
  3. Scoring
    • Specificity: Measured by how closely each intermediate term relates to its neighbors (using lexical similarity and knowledge‑graph distance).
    • Diversity: Computed as pairwise dissimilarity across all generated paths (e.g., Jaccard distance on the set of intermediate nodes).
    • The final score rewards large collections of high‑specificity, high‑diversity paths.
  4. Benchmark construction – The authors curated thousands of concept pairs from commonsense resources and verified that a human evaluator can reliably judge the quality of generated paths, enabling automatic grading.

Results & Findings

Model / SettingAvg. CREATE ScoreObservations
GPT‑4 (baseline)0.62Strong specificity but limited diversity; many paths overlap.
GPT‑4 + CoT0.64Slight boost in specificity; diversity unchanged.
GPT‑4 + Creative Prompt0.68Notable improvement in both dimensions, but still far from human performance.
Larger token budget (4×)0.66More paths generated, but diminishing returns on diversity.
Human annotators0.85Humans naturally produce a richer variety of connections.

Key takeaways:

  • Creative prompting helps, but the gap to human‑level associative reasoning remains sizable.
  • Simply giving the model more tokens does not guarantee better diversity; smarter search (e.g., sampling strategies) is needed.
  • The benchmark’s multiplicity of valid answers makes it resistant to “gaming”—models can’t just memorize a single correct path.

Practical Implications

  • Idea generation tools: Developers building brainstorming assistants, product‑naming services, or hypothesis‑generation platforms can use CREATE as a validation suite to ensure their models produce genuinely novel suggestions rather than rehashing common tropes.
  • Knowledge‑graph enrichment: Automated path creation can surface hidden relations for graph construction, improving downstream tasks like recommendation or semantic search.
  • Prompt engineering: The findings highlight the importance of task‑specific prompts that explicitly ask for “unusual but plausible” connections, a useful pattern for any creative‑AI application.
  • Evaluation pipeline: CREATE’s objective scoring can be integrated into CI pipelines for LLM fine‑tuning, giving engineers a quantitative signal for “creativity” alongside traditional accuracy metrics.

Limitations & Future Work

  • Domain coverage: The benchmark focuses on everyday concepts; specialized domains (e.g., biomedical, legal) may require tailored path vocabularies.
  • Scoring heuristics: Specificity and diversity rely on lexical and graph‑based proxies, which may not capture deeper semantic novelty.
  • Model size bias: Larger models tend to generate more paths, but the metric does not fully normalize for capacity differences.
  • Future directions suggested by the authors include: developing search‑guided decoding (e.g., Monte‑Carlo tree search), incorporating external knowledge bases to broaden the concept space, and extending the benchmark to multimodal creativity (linking images, code, or music).

Authors

  • Manya Wadhwa
  • Tiasa Singha Roy
  • Harvey Lederman
  • Junyi Jessy Li
  • Greg Durrett

Paper Information

  • arXiv ID: 2603.09970v1
  • Categories: cs.CL
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »