[Paper] Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

Published: 3 days ago (June 1, 2026 at 05:36 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.01975v1

Overview

The paper explores how large language models (LLMs) can be turned into “algorithmic assistants” that actually write and improve code. Using the concrete problem of finding optimal contraction orders for tensor networks—a key step in quantum simulation and machine‑learning workloads—the authors build an evolutionary coding framework called OpenEvolve that lets an LLM propose, test, and refine algorithmic ideas. Their findings show that verifier‑guided LLM agents can discover competitive heuristics, but they also underline the need for careful human oversight in evaluation and interpretation.

Key Contributions

Verifier‑guided evolutionary coding loop: Introduces a pipeline where an LLM generates candidate algorithms, a fast verifier evaluates them, and the best candidates evolve over generations.
Case study on tensor‑network contraction ordering: Demonstrates the approach on a mathematically demanding problem with real‑world relevance (e.g., quantum chemistry, tensor‑based deep learning).
Empirical comparison of LLMs: Benchmarks several LLM families (GPT‑4, Claude, Llama‑2) for code generation quality, speed, and ability to incorporate feedback.
Metric design for algorithmic quality: Proposes a composite evaluation metric that balances contraction cost, runtime, and code readability/maintainability.
Open‑source implementation (OpenEvolve): Releases the framework and a curated set of tensor‑network benchmarks for reproducibility.

Methodology

Problem Formalization: The contraction‑order problem is cast as a combinatorial optimization task: given a hypergraph representing a tensor network, find an ordering that minimizes the total floating‑point operations (FLOPs).
LLM Prompt Engineering: The authors craft prompts that ask the LLM to output Python code implementing a heuristic (e.g., greedy edge‑cut, tree‑decomposition) together with a short rationale.
Verifier Loop:
- The generated code is sandboxed and run on a small set of test networks.
- A fast verifier computes the actual contraction cost and flags runtime errors or violations of API contracts.
- A fitness score (cost + penalty) is returned to the LLM.
Evolutionary Selection: The top‑k candidates survive to the next generation; mutations are introduced by asking the LLM to “tweak” a surviving snippet (e.g., change a sorting key, add a pruning step).
Benchmark Suite: A diverse collection of synthetic and real tensor networks (e.g., from quantum chemistry, MERA, and tensor‑train models) is used for training and final evaluation.
Human‑in‑the‑Loop Validation: After several generations, the best algorithms are inspected by the authors for correctness, readability, and potential for further hand‑tuning.

Results & Findings

LLM	Best‑found heuristic (relative FLOP reduction)	Avg. verification time	Code readability score*
GPT‑4	Hybrid greedy + look‑ahead (≈ 12 % lower cost vs. baseline)	0.42 s	8.1 /10
Claude	Simple greedy with edge‑weight scaling (≈ 8 % reduction)	0.38 s	7.6 /10
Llama‑2‑13B	Randomized local search (≈ 4 % reduction)	0.55 s	6.9 /10

Performance: The GPT‑4‑driven agents consistently outperformed hand‑crafted baselines from the literature on the test suite, achieving up to a 12 % reduction in contraction cost.
Speed: The verifier loop kept the evolutionary cycle under a second per candidate, enabling thousands of iterations in a few hours.
Interpretability: Human reviewers noted that the best GPT‑4 solutions were surprisingly modular and documented, making them easier to adopt.
Robustness: When the test set was shuffled, the evolutionary process still converged to high‑quality heuristics, indicating that the approach is not over‑fitting to a single network.

Practical Implications

Accelerated algorithm prototyping: Developers can plug an LLM into an existing codebase to explore new heuristics without writing them from scratch, dramatically shortening the R&D cycle for performance‑critical kernels.
Domain‑agnostic optimization: While the study focuses on tensor networks, the verifier‑guided evolutionary loop is applicable to any combinatorial or numerical optimization problem where a fast cost estimator exists (e.g., graph partitioning, compiler pass ordering).
Tooling for quantum‑simulation stacks: Libraries such as quimb, TensorNetwork, and quantum‑chemistry packages can integrate OpenEvolve to automatically generate better contraction strategies for user‑provided circuits or Hamiltonians.
Educational aid: The framework produces readable, commented code, serving as a teaching resource for developers learning advanced tensor‑network techniques.

Limitations & Future Work

Verifier dependence: The quality of the evolved algorithms hinges on the fidelity of the cost estimator; inaccurate verification can misguide the LLM.
Scalability of the search space: For very large networks, the verifier becomes a bottleneck, and the evolutionary loop may need hierarchical or surrogate‑model extensions.
LLM access & cost: High‑performing models like GPT‑4 incur non‑trivial API costs, which may limit adoption in resource‑constrained settings.
Generalization: The study evaluated a single problem domain; future work should test the pipeline on other algorithmic challenges (e.g., sorting networks, SAT solving) and explore multi‑objective fitness functions (energy consumption, memory footprint).

*Readability score is a subjective rating by three independent developers on a 1‑10 scale.

Authors

Fabian Hoppe
Melven Röhrig-Zöllner
Philipp Knechtges

Paper Information

arXiv ID: 2606.01975v1
Categories: cs.AI, cs.SE
Published: June 1, 2026
PDF: Download PDF

[Paper] Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning

[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

[Paper] Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization