[Paper] Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery
Source: arXiv - 2602.03132v1
Overview
The paper introduces Contrastive Concept‑Tree Search (CCTS), a new way to steer large language models (LLMs) when they are used to discover algorithms. By extracting a hierarchy of “concepts” from the programs the LLM generates and learning which concepts tend to lead to high‑performing solutions, CCTS dramatically speeds up the search for effective algorithms while keeping the process interpretable.
Key Contributions
- Concept‑tree extraction: Parse generated code into a hierarchical tree of reusable algorithmic concepts (e.g., “graph traversal”, “union‑find”, “binary search”).
- Contrastive learning of concept utility: A lightweight model scores concepts by comparing their prevalence in high‑ vs. low‑performing programs, producing a likelihood‑ratio “contrastive score”.
- Guided parent selection: During the iterative generate‑evaluate loop, CCTS re‑weights candidate parent programs based on contrastive scores, biasing the LLM toward promising concept combinations.
- Empirical gains on combinatorial benchmarks: Across a suite of Erdős‑type problems (graph coloring, Ramsey‑type constructions, etc.), CCTS reduces the number of LLM queries needed to reach a target performance by 30‑50 % compared with fitness‑only baselines.
- Interpretability: The learned concept trees expose which algorithmic building blocks are useful or harmful for each task, offering human‑readable insights.
- Synthetic validation: A controlled environment with a known concept space reproduces the same search dynamics, confirming that the observed improvements stem from learning to avoid “bad” concepts rather than from LLM quirks.
Methodology
-
Generate candidate programs – An LLM (e.g., GPT‑4) proposes code snippets that aim to solve a combinatorial task.
-
Parse into concepts – A static‑analysis pass extracts high‑level algorithmic motifs (loops, data structures, recursion patterns) and organizes them into a tree where each node is a concept and edges represent composition.
-
Contrastive scoring – The system maintains two concept frequency distributions: one from the top‑k performing programs and one from the bottom‑k. The contrastive score for a concept (c) is
[ s(c)=\log\frac{P_{\text{high}}(c)}{P_{\text{low}}(c)}. ]
Positive scores indicate “good” concepts; negative scores flag “bad” ones.
-
Parent re‑weighting – When the LLM asks for a parent program to mutate, CCTS multiplies the LLM’s native probability of each parent by
[ \exp!\left(\sum_{c\in\text{parent}} s(c)\right). ]
This pushes the search toward parents rich in high‑scoring concepts.
-
Iterate – The LLM mutates the selected parent, the evaluator scores the new program, and the concept model updates. The loop continues until a performance threshold is met or a query budget is exhausted.
Results & Findings
| Benchmark | Baseline (fitness‑only) | CCTS | Improvement |
|---|---|---|---|
| Graph coloring (n=50) | 120 LLM queries to reach 90 % success | 68 queries | ≈ 43 % fewer queries |
| Ramsey‑type construction | 95 queries | 55 queries | ≈ 42 % fewer queries |
| Hamiltonian path generation | 140 queries | 78 queries | ≈ 44 % fewer queries |
- Concept avoidance drives gains: Ablation where only high‑scoring concepts are boosted (without penalizing low‑scoring ones) yields modest speed‑ups; the biggest win comes from down‑weighting “bad” concepts.
- Interpretability: The final concept trees highlight, for example, that “greedy edge‑selection” is detrimental for Ramsey problems, while “union‑find with path compression” is consistently beneficial across graph tasks.
- Synthetic tests: In a toy domain where the ground‑truth concept utility is known, CCTS recovers the correct scores within a few iterations, mirroring the real‑LLM behavior.
Practical Implications
- Faster prototype generation: Teams using LLMs to auto‑generate algorithmic code (e.g., for competitive programming assistants, automated theorem provers, or domain‑specific optimizers) can cut the number of expensive LLM calls by nearly half.
- Reduced compute cost: Since each LLM query can cost several dollars in cloud APIs, CCTS translates directly into lower operational expenses for SaaS products that embed code‑generation loops.
- Human‑in‑the‑loop debugging: The concept tree offers a concise “debug view” of what the model is trying, letting engineers prune or inject concepts manually (e.g., force the use of a balanced BST).
- Transferable to other domains: The contrastive concept model is agnostic to the underlying LLM; it could be applied to symbolic regression, circuit synthesis, or even prompt engineering for non‑code tasks where hierarchical concepts exist.
Limitations & Future Work
- Concept extraction relies on static analysis: Complex Pythonic idioms or heavily obfuscated code may be mis‑parsed, limiting the fidelity of the concept tree.
- Benchmarks focus on combinatorial mathematics: Real‑world software engineering problems (e.g., API integration, UI generation) may involve richer, less‑structured concepts that need more sophisticated extraction pipelines.
- Scalability of contrastive scores: As the concept vocabulary grows, estimating reliable low‑performing frequencies can become noisy; smoothing or Bayesian priors may be required.
- Future directions: Extending CCTS to multi‑objective settings (e.g., balancing runtime and memory), integrating reinforcement‑learning‑style policy updates, and exploring automatic concept‑library expansion from open‑source repositories.
Authors
- Timothee Leleu
- Sudeera Gunathilaka
- Federico Ghimenti
- Surya Ganguli
Paper Information
- arXiv ID: 2602.03132v1
- Categories: cs.LG, cs.AI, cs.NE
- Published: February 3, 2026
- PDF: Download PDF