[Paper] Self-Improving Language Models with Bidirectional Evolutionary Search

Published: 2 weeks ago (May 27, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.28814v1

Overview

The paper introduces Bidirectional Evolutionary Search (BES), a new framework that lets language models improve themselves both during post‑training fine‑tuning and at inference time. By combining forward “evolution” of candidate texts with a backward decomposition of the original task into checkable sub‑goals, BES overcomes the sparsity of feedback and the limited exploration of classic best‑of‑N or tree‑search methods.

Key Contributions

Bidirectional search paradigm: couples a forward evolutionary generation process with a backward goal‑decomposition routine, providing dense intermediate supervision.
Evolutionary operators for text: novel recombination and mutation mechanisms that merge partial trajectories, enabling the model to explore low‑probability but high‑utility regions of the output space.
Theoretical analysis: proves that pure autoregressive expansion stays within a narrow “entropy shell,” while evolutionary steps can escape it; backward decomposition can reduce the sample complexity exponentially.
Empirical validation: demonstrates consistent gains on difficult post‑training tasks where existing self‑improvement methods stall, and sets new state‑of‑the‑art results on three open‑ended problem‑solving benchmarks at inference time.
Open‑source release: provides code, pretrained checkpoints, and a ready‑to‑use library for the community.

Methodology

1. Forward Evolutionary Search

Starts from a set of seed completions generated by the base language model.
Applies mutation (small perturbations to token sequences) and crossover (splicing together fragments from two different candidates) to create new hybrid completions.
These operators are analogous to genetic algorithms but are tailored for discrete text, allowing the search to jump to regions that a single rollout would never reach.

2. Backward Goal Decomposition

The original task (e.g., “solve this math puzzle”) is recursively broken down into smaller, verifiable sub‑goals (e.g., “compute X”, “check Y”).
Each sub‑goal yields a dense verification signal (pass/fail, numeric error, etc.) that can be evaluated cheaply.
The feedback is fed back to the forward search, biasing mutation/crossover toward candidates that satisfy more sub‑goals.

3. Iterative Loop

The forward and backward components run in tandem: the backward module proposes a hierarchy of sub‑goals, the forward module explores candidate solutions, and the verification scores prune or promote candidates.
The loop continues until a stopping criterion (budget of generations, convergence of scores, or a hard deadline) is met.

4. Training‑Free Self‑Improvement

BES does not require gradient updates; it works directly on the frozen language model, making it applicable to any off‑the‑shelf LLM.

Results & Findings

Setting	Baseline	BES (Avg.)	BES (Best)
Post‑training text refinement (synthetic QA)	No improvement over base LM	+7.3 % exact match	+12.1 % exact match
Open‑ended reasoning (HotpotQA‑style)	42.5 % EM	48.9 % EM	55.2 % EM
Code generation (HumanEval)	21.4 % pass@1	27.6 % pass@1	33.1 % pass@1

Escape from entropy shell: Evolutionary recombination produced candidates with log‑probabilities up to 3× lower than any pure autoregressive rollout, yet they achieved higher task success.
Sample efficiency: Backward decomposition reduced the number of forward generations needed to hit a correct answer by roughly an order of magnitude compared with best‑of‑N sampling.
Robustness: BES maintained gains across model sizes (7B‑30B) and domains (math, commonsense, code), indicating that the approach is not tied to a specific architecture.

Practical Implications

Plug‑and‑play improvement: Developers can wrap BES around any existing LLM (OpenAI, Anthropic, LLaMA, etc.) without retraining, instantly boosting performance on complex prompts.
Cost‑effective inference: Because BES relies on cheap verification (e.g., unit tests for code, constraint checks for math) rather than large beam widths, it can achieve higher quality answers with comparable or lower compute budgets.
Better autonomous agents: For agents that need to plan and self‑debug (e.g., robotic instruction generation, data‑pipeline synthesis), the backward decomposition supplies a natural “self‑check” loop, reducing hallucinations.
Open‑source ecosystem: The released library integrates with popular frameworks (Transformers, LangChain), making it straightforward to add evolutionary search to existing pipelines.
Potential for safety: Dense verification signals can incorporate policy checks (toxicity, privacy), allowing BES to filter out unsafe generations early in the search.

Limitations & Future Work

Verification dependency: BES’s gains hinge on having reliable, automatically checkable sub‑goals; tasks lacking clear constraints may see limited benefit.
Search overhead: While more sample‑efficient than brute‑force, the evolutionary loop introduces latency (multiple generations, recombination steps) that may be unsuitable for ultra‑low‑latency applications.
Scalability of crossover: Designing effective recombination operators for very long texts (e.g., multi‑page documents) remains an open challenge.
Theoretical bounds: The current analysis assumes ideal sub‑goal decomposition; extending proofs to noisy or approximate checks is future work.
Human‑in‑the‑loop extensions: Exploring how minimal human feedback could guide the backward decomposition could further improve performance on ambiguous tasks.

Bidirectional Evolutionary Search shows that marrying classic evolutionary ideas with modern language models can unlock richer exploration and smarter self‑verification, offering a practical boost for developers building smarter, more reliable AI systems.

Authors

Guowei Xu
Zhenting Qi
Huangyuan Su
Weirui Ye
Himabindu Lakkaraju
Sham M. Kakade
Yilun Du

Paper Information

arXiv ID: 2605.28814v1
Categories: cs.CL
Published: May 27, 2026
PDF: Download PDF