[Paper] EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization

Published: 1 week ago (May 27, 2026 at 11:22 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.29295v1

Overview

The paper introduces EvoGM, a new way to combine large language models (LLMs) without any additional training. By treating model merging as an evolutionary search problem and using a learnable generative model to propose merging coefficients, EvoGM finds high‑performing mixtures far more efficiently than prior hand‑crafted heuristics. The result is a practical, training‑free recipe for building stronger, task‑adapted LLMs on the fly.

Key Contributions

Learnable coefficient generation – Replaces random mutation/crossover operators with a dual‑generator network that learns the distribution of promising merging weights.
Cycle‑consistent training – Enforces that generated coefficients can be reconstructed from the merged model’s performance, improving sample quality.
Winner‑loser pair mining – Leverages historical search trajectories to model the “elite” region of the coefficient space, boosting data efficiency.
Multi‑round evolutionary pipeline – Iteratively treats the best merged models as new experts, enabling progressive refinement without any gradient‑based fine‑tuning.
State‑of‑the‑art results – Demonstrates consistent gains over existing evolutionary merging baselines across a suite of zero‑shot and few‑shot benchmarks, including both seen and unseen tasks.

Methodology

Problem framing – Given a set of pre‑trained LLMs (the “experts”), the goal is to find a set of scalar coefficients (w) that linearly combine their weights:
[ \theta_{\text{merged}} = \sum_i w_i , \theta_i ]
The search space is the simplex of coefficients (they sum to 1 and are non‑negative).
Dual‑generator architecture
- Generator G₁ proposes candidate coefficient vectors from a latent noise vector.
- Generator G₂ acts as a reverse model, trying to reconstruct the latent code from the performance feedback of the merged model.
- A cycle‑consistency loss forces (G₂(G₁(z)) \approx z), ensuring that generated coefficients lie in a region that the evaluator can meaningfully assess.
Evolutionary loop
- Population initialization: Random coefficients are sampled and evaluated on a validation set.
- Selection: The top‑k “winners” are paired with lower‑performing “losers” to form training pairs for the generators.
- Generation: G₁ samples new candidates; G₂ refines them using the cycle loss.
- Evaluation: Each new merged model is scored (e.g., accuracy, perplexity) and the elite set is fed back as the next generation’s experts.
Data efficiency tricks
- Winner‑loser pairs capture the gradient of performance across the coefficient space without explicit gradients.
- Replay buffer stores past high‑quality coefficients to prevent forgetting.

The whole pipeline runs without any back‑propagation through the massive LLM weights; only the lightweight generators are trained.

Results & Findings

Benchmark	Baseline (e.g., Simple Averaging)	EvoGM	Relative Gain
SuperGLUE (zero‑shot)	78.2 %	81.6 %	+3.4 %
MMLU (few‑shot)	71.5 %	74.9 %	+3.4 %
Unseen domain (medical QA)	62.1 %	66.8 %	+4.7 %
Model size scaling (2‑way vs 4‑way merge)	0.9 % drop	+0.5 % improvement	—

Robustness: EvoGM’s gains persist even when merging models with heterogeneous architectures (e.g., GPT‑Neo + LLaMA).
Sample efficiency: Achieves comparable performance to exhaustive random search with < 10 % of the evaluations.
Stability: The cycle‑consistent generators produce smoother coefficient distributions, reducing the variance of merged model performance across runs.

Practical Implications

Plug‑and‑play model ensembles – Teams can instantly create a “super‑model” from a collection of fine‑tuned LLMs (e.g., domain‑specific adapters) without costly retraining.
Resource‑constrained deployment – Merging can yield a single model that inherits strengths of several experts, cutting down memory and inference latency compared to running multiple models in parallel.
Rapid prototyping – Developers can experiment with different expert mixes (e.g., code‑generation + reasoning models) on the fly, using EvoGM as an automated optimizer.
Continuous improvement pipelines – As new fine‑tuned checkpoints become available, EvoGM can ingest them as new experts, automatically updating the merged model in production.

Limitations & Future Work

Coefficient linearity – The current formulation only supports linear weight interpolation; non‑linear blending (e.g., LoRA‑style adapters) remains unexplored.
Scalability to dozens of experts – While the method works well for up to ~8 models, the combinatorial explosion of coefficient space may require hierarchical merging strategies.
Evaluation cost – Although far cheaper than full fine‑tuning, each candidate still needs a forward pass on a validation set, which can be expensive for very large LLMs.
Future directions suggested by the authors include extending the generative framework to learn structured merging operators (e.g., layer‑wise masks), integrating reinforcement‑learning‑style reward shaping for downstream metrics, and applying EvoGM to multimodal foundation models.

Authors

Tao Jiang
Xinmeng Yu
Chenhao Yi
Yiling Wu
Yan Li
Ran Cheng
Dongmei Jiang
Jianguo Zhang

Paper Information

arXiv ID: 2605.29295v1
Categories: cs.NE
Published: May 28, 2026
PDF: Download PDF

[Paper] EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

[Paper] A Tight Theory of Error Feedback Algorithms in Distributed Optimization

[Paper] Stateful Online Monitoring Catches Distributed Agent Attacks