[Paper] Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery

Published: 3 months ago (February 3, 2026 at 12:41 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.03132v1

Overview

The paper introduces Contrastive Concept‑Tree Search (CCTS), a new way to steer large language models (LLMs) when they are used to discover algorithms. By extracting a hierarchy of “concepts” from the programs the LLM generates and learning which concepts tend to lead to high‑performing solutions, CCTS dramatically speeds up the search for effective algorithms while keeping the process interpretable.

Key Contributions

Concept‑tree extraction: Parse generated code into a hierarchical tree of reusable algorithmic concepts (e.g., “graph traversal”, “union‑find”, “binary search”).
Contrastive learning of concept utility: A lightweight model scores concepts by comparing their prevalence in high‑ vs. low‑performing programs, producing a likelihood‑ratio “contrastive score”.
Guided parent selection: During the iterative generate‑evaluate loop, CCTS re‑weights candidate parent programs based on contrastive scores, biasing the LLM toward promising concept combinations.
Empirical gains on combinatorial benchmarks: Across a suite of Erdős‑type problems (graph coloring, Ramsey‑type constructions, etc.), CCTS reduces the number of LLM queries needed to reach a target performance by 30‑50 % compared with fitness‑only baselines.
Interpretability: The learned concept trees expose which algorithmic building blocks are useful or harmful for each task, offering human‑readable insights.
Synthetic validation: A controlled environment with a known concept space reproduces the same search dynamics, confirming that the observed improvements stem from learning to avoid “bad” concepts rather than from LLM quirks.

Methodology

Generate candidate programs – An LLM (e.g., GPT‑4) proposes code snippets that aim to solve a combinatorial task.
Parse into concepts – A static‑analysis pass extracts high‑level algorithmic motifs (loops, data structures, recursion patterns) and organizes them into a tree where each node is a concept and edges represent composition.
Contrastive scoring – The system maintains two concept frequency distributions: one from the top‑k performing programs and one from the bottom‑k. The contrastive score for a concept (c) is
[ s(c)=\log\frac{P_{\text{high}}(c)}{P_{\text{low}}(c)}. ]
Positive scores indicate “good” concepts; negative scores flag “bad” ones.
Parent re‑weighting – When the LLM asks for a parent program to mutate, CCTS multiplies the LLM’s native probability of each parent by
[ \exp!\left(\sum_{c\in\text{parent}} s(c)\right). ]
This pushes the search toward parents rich in high‑scoring concepts.
Iterate – The LLM mutates the selected parent, the evaluator scores the new program, and the concept model updates. The loop continues until a performance threshold is met or a query budget is exhausted.

Results & Findings

Benchmark	Baseline (fitness‑only)	CCTS	Improvement
Graph coloring (n=50)	120 LLM queries to reach 90 % success	68 queries	≈ 43 % fewer queries
Ramsey‑type construction	95 queries	55 queries	≈ 42 % fewer queries
Hamiltonian path generation	140 queries	78 queries	≈ 44 % fewer queries

Concept avoidance drives gains: Ablation where only high‑scoring concepts are boosted (without penalizing low‑scoring ones) yields modest speed‑ups; the biggest win comes from down‑weighting “bad” concepts.
Interpretability: The final concept trees highlight, for example, that “greedy edge‑selection” is detrimental for Ramsey problems, while “union‑find with path compression” is consistently beneficial across graph tasks.
Synthetic tests: In a toy domain where the ground‑truth concept utility is known, CCTS recovers the correct scores within a few iterations, mirroring the real‑LLM behavior.

Practical Implications

Faster prototype generation: Teams using LLMs to auto‑generate algorithmic code (e.g., for competitive programming assistants, automated theorem provers, or domain‑specific optimizers) can cut the number of expensive LLM calls by nearly half.
Reduced compute cost: Since each LLM query can cost several dollars in cloud APIs, CCTS translates directly into lower operational expenses for SaaS products that embed code‑generation loops.
Human‑in‑the‑loop debugging: The concept tree offers a concise “debug view” of what the model is trying, letting engineers prune or inject concepts manually (e.g., force the use of a balanced BST).
Transferable to other domains: The contrastive concept model is agnostic to the underlying LLM; it could be applied to symbolic regression, circuit synthesis, or even prompt engineering for non‑code tasks where hierarchical concepts exist.

Limitations & Future Work

Concept extraction relies on static analysis: Complex Pythonic idioms or heavily obfuscated code may be mis‑parsed, limiting the fidelity of the concept tree.
Benchmarks focus on combinatorial mathematics: Real‑world software engineering problems (e.g., API integration, UI generation) may involve richer, less‑structured concepts that need more sophisticated extraction pipelines.
Scalability of contrastive scores: As the concept vocabulary grows, estimating reliable low‑performing frequencies can become noisy; smoothing or Bayesian priors may be required.
Future directions: Extending CCTS to multi‑objective settings (e.g., balancing runtime and memory), integrating reinforcement‑learning‑style policy updates, and exploring automatic concept‑library expansion from open‑source repositories.

Authors

Timothee Leleu
Sudeera Gunathilaka
Federico Ghimenti
Surya Ganguli

Paper Information

arXiv ID: 2602.03132v1
Categories: cs.LG, cs.AI, cs.NE
Published: February 3, 2026
PDF: Download PDF

[Paper] Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data