[Paper] Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks
Source: arXiv - 2606.01975v1
Overview
The paper explores how large language models (LLMs) can be turned into “algorithmic assistants” that actually write and improve code. Using the concrete problem of finding optimal contraction orders for tensor networks—a key step in quantum simulation and machine‑learning workloads—the authors build an evolutionary coding framework called OpenEvolve that lets an LLM propose, test, and refine algorithmic ideas. Their findings show that verifier‑guided LLM agents can discover competitive heuristics, but they also underline the need for careful human oversight in evaluation and interpretation.
Key Contributions
- Verifier‑guided evolutionary coding loop: Introduces a pipeline where an LLM generates candidate algorithms, a fast verifier evaluates them, and the best candidates evolve over generations.
- Case study on tensor‑network contraction ordering: Demonstrates the approach on a mathematically demanding problem with real‑world relevance (e.g., quantum chemistry, tensor‑based deep learning).
- Empirical comparison of LLMs: Benchmarks several LLM families (GPT‑4, Claude, Llama‑2) for code generation quality, speed, and ability to incorporate feedback.
- Metric design for algorithmic quality: Proposes a composite evaluation metric that balances contraction cost, runtime, and code readability/maintainability.
- Open‑source implementation (OpenEvolve): Releases the framework and a curated set of tensor‑network benchmarks for reproducibility.
Methodology
- Problem Formalization: The contraction‑order problem is cast as a combinatorial optimization task: given a hypergraph representing a tensor network, find an ordering that minimizes the total floating‑point operations (FLOPs).
- LLM Prompt Engineering: The authors craft prompts that ask the LLM to output Python code implementing a heuristic (e.g., greedy edge‑cut, tree‑decomposition) together with a short rationale.
- Verifier Loop:
- The generated code is sandboxed and run on a small set of test networks.
- A fast verifier computes the actual contraction cost and flags runtime errors or violations of API contracts.
- A fitness score (cost + penalty) is returned to the LLM.
- Evolutionary Selection: The top‑k candidates survive to the next generation; mutations are introduced by asking the LLM to “tweak” a surviving snippet (e.g., change a sorting key, add a pruning step).
- Benchmark Suite: A diverse collection of synthetic and real tensor networks (e.g., from quantum chemistry, MERA, and tensor‑train models) is used for training and final evaluation.
- Human‑in‑the‑Loop Validation: After several generations, the best algorithms are inspected by the authors for correctness, readability, and potential for further hand‑tuning.
Results & Findings
| LLM | Best‑found heuristic (relative FLOP reduction) | Avg. verification time | Code readability score* |
|---|---|---|---|
| GPT‑4 | Hybrid greedy + look‑ahead (≈ 12 % lower cost vs. baseline) | 0.42 s | 8.1 /10 |
| Claude | Simple greedy with edge‑weight scaling (≈ 8 % reduction) | 0.38 s | 7.6 /10 |
| Llama‑2‑13B | Randomized local search (≈ 4 % reduction) | 0.55 s | 6.9 /10 |
- Performance: The GPT‑4‑driven agents consistently outperformed hand‑crafted baselines from the literature on the test suite, achieving up to a 12 % reduction in contraction cost.
- Speed: The verifier loop kept the evolutionary cycle under a second per candidate, enabling thousands of iterations in a few hours.
- Interpretability: Human reviewers noted that the best GPT‑4 solutions were surprisingly modular and documented, making them easier to adopt.
- Robustness: When the test set was shuffled, the evolutionary process still converged to high‑quality heuristics, indicating that the approach is not over‑fitting to a single network.
Practical Implications
- Accelerated algorithm prototyping: Developers can plug an LLM into an existing codebase to explore new heuristics without writing them from scratch, dramatically shortening the R&D cycle for performance‑critical kernels.
- Domain‑agnostic optimization: While the study focuses on tensor networks, the verifier‑guided evolutionary loop is applicable to any combinatorial or numerical optimization problem where a fast cost estimator exists (e.g., graph partitioning, compiler pass ordering).
- Tooling for quantum‑simulation stacks: Libraries such as
quimb,TensorNetwork, and quantum‑chemistry packages can integrate OpenEvolve to automatically generate better contraction strategies for user‑provided circuits or Hamiltonians. - Educational aid: The framework produces readable, commented code, serving as a teaching resource for developers learning advanced tensor‑network techniques.
Limitations & Future Work
- Verifier dependence: The quality of the evolved algorithms hinges on the fidelity of the cost estimator; inaccurate verification can misguide the LLM.
- Scalability of the search space: For very large networks, the verifier becomes a bottleneck, and the evolutionary loop may need hierarchical or surrogate‑model extensions.
- LLM access & cost: High‑performing models like GPT‑4 incur non‑trivial API costs, which may limit adoption in resource‑constrained settings.
- Generalization: The study evaluated a single problem domain; future work should test the pipeline on other algorithmic challenges (e.g., sorting networks, SAT solving) and explore multi‑objective fitness functions (energy consumption, memory footprint).
*Readability score is a subjective rating by three independent developers on a 1‑10 scale.
Authors
- Fabian Hoppe
- Melven Röhrig-Zöllner
- Philipp Knechtges
Paper Information
- arXiv ID: 2606.01975v1
- Categories: cs.AI, cs.SE
- Published: June 1, 2026
- PDF: Download PDF