[Paper] Even with AI, Bijection Discovery is Still Hard: The Opportunities and Challenges of OpenEvolve for Novel Bijection Construction

Published: (November 25, 2025 at 09:30 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.20987v1

Overview

The paper investigates how OpenEvolve, an evolutionary program‑synthesis system that orchestrates multiple large language models (LLMs), performs on a classic combinatorial challenge: constructing explicit bijections. By tackling three bijection problems on Dyck paths—two already solved in the literature and one still open—the authors assess whether AI can autonomously discover novel, research‑grade combinatorial constructions.

Key Contributions

  • First systematic study of LLM‑driven evolution for bijection discovery, extending AI‑assisted math beyond inequality‑type results.
  • Implementation of OpenEvolve pipelines that translate LLM‑generated Python/Mathematica code into executable candidates and iteratively improve them via mutation, crossover, and fitness evaluation.
  • Empirical evaluation on three Dyck‑path bijection tasks, showing that the system can reproduce known bijections but struggles to invent new ones.
  • Practical guidelines and “lessons learned” for researchers who want to apply evolutionary synthesis to other constructive mathematics problems.
  • Open‑source artifacts (prompt templates, fitness functions, and benchmark scripts) released for reproducibility.

Methodology

  1. Problem Formalization – Each bijection task is expressed as a construction function: given a combinatorial object (e.g., a Dyck path), output another object of equal cardinality (e.g., a different lattice path).
  2. LLM Generation – A team of LLMs (GPT‑4, Claude, Llama‑2) is prompted to write candidate construction code in a high‑level language (Python with networkx/itertools).
  3. Evolutionary Loop
    • Fitness: a candidate passes if it is bijective (injective + surjective) on a sampled test set of size N (e.g., N = 10 000).
    • Selection: top‑k candidates survive.
    • Variation: surviving code is mutated (e.g., rename variables, tweak loops) or crossed over with another survivor to produce offspring.
  4. Verification – After each generation, surviving programs are run on a larger validation set to guard against overfitting to the training samples.
  5. Human‑in‑the‑Loop – Researchers inspect high‑performing scripts, prune nonsensical mutations, and occasionally inject domain‑specific hints (e.g., “preserve the number of peaks”).

The pipeline is deliberately lightweight so that developers can plug in any LLM API and any combinatorial domain with a well‑defined bijection predicate.

Results & Findings

TaskKnown Bijection Reproduced?Novel Bijection Discovered?Generation Time (≈)
Dyck ↔ Balanced Parentheses (classic)✅ 100 % of runs converged to the textbook construction within 15 generations.~30 min
Dyck ↔ Motzkin Paths (known)✅ 78 % of runs found the published mapping; others produced equivalent but syntactically different code.~45 min
Dyck ↔ “Peak‑Shift” Paths (open)❌ No candidate passed the bijectivity test on the full validation set.❌ No breakthrough; best‑fit scripts achieved ~85 % correctness after 30 generations.~1 h

Interpretation

  • The evolutionary framework reliably re‑discovers existing bijections when the underlying combinatorial insight is relatively straightforward.
  • For truly open problems, the system stalls at locally optimal but incomplete constructions, indicating that current LLM reasoning and mutation operators lack the deeper structural intuition required for novel proofs.
  • Human inspection remains essential to filter out syntactically correct but mathematically meaningless code (e.g., functions that always return a constant object).

Practical Implications

  • Rapid Prototyping of Combinatorial Tools – Developers can use OpenEvolve‑style pipelines to auto‑generate boilerplate bijections for libraries dealing with Catalan objects, saving weeks of manual coding.
  • Educational Assistants – An AI tutor could present multiple valid constructions for a given combinatorial equivalence, helping students explore alternative proofs.
  • Domain‑Specific Synthesis – The same evolutionary‑LLM loop can be repurposed for generating parsers, data‑structure transformations, or compiler optimizations where a bijection between input and output representations is required.
  • Research Augmentation – While the system won’t replace mathematicians, it can act as a “search engine” that surfaces promising candidate constructions for human experts to refine, potentially accelerating discovery in enumerative combinatorics, cryptography (bijections between groups), and coding theory.

Limitations & Future Work

  • Fitness Approximation – Relying on sampled test sets can miss subtle violations of bijectivity; exhaustive verification is infeasible for large combinatorial families.
  • LLM Knowledge Gaps – Current models lack deep, formal reasoning about invariants, leading to dead‑ends on open problems.
  • Mutation Design – Simple syntactic mutations often produce meaningless code; richer, semantics‑preserving operators (e.g., “swap recursive calls”) could improve search efficiency.
  • Scalability – The evolutionary loop becomes costly as the search space grows; integrating reinforcement‑learning‑based policy updates may reduce generations needed.

Future research directions include tighter integration with theorem provers for on‑the‑fly correctness checks, exploring multi‑objective fitness (e.g., simplicity vs. correctness), and extending the framework to other constructive domains such as graph isomorphisms or algebraic structure embeddings.

Authors

  • Davis Brown
  • Jesse He
  • Helen Jenne
  • Henry Kvinge
  • Max Vargas

Paper Information

  • arXiv ID: 2511.20987v1
Back to Blog

Related posts

Read more »