[Paper] EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization
Source: arXiv - 2605.29295v1
Overview
The paper introduces EvoGM, a new way to combine large language models (LLMs) without any additional training. By treating model merging as an evolutionary search problem and using a learnable generative model to propose merging coefficients, EvoGM finds high‑performing mixtures far more efficiently than prior hand‑crafted heuristics. The result is a practical, training‑free recipe for building stronger, task‑adapted LLMs on the fly.
Key Contributions
- Learnable coefficient generation – Replaces random mutation/crossover operators with a dual‑generator network that learns the distribution of promising merging weights.
- Cycle‑consistent training – Enforces that generated coefficients can be reconstructed from the merged model’s performance, improving sample quality.
- Winner‑loser pair mining – Leverages historical search trajectories to model the “elite” region of the coefficient space, boosting data efficiency.
- Multi‑round evolutionary pipeline – Iteratively treats the best merged models as new experts, enabling progressive refinement without any gradient‑based fine‑tuning.
- State‑of‑the‑art results – Demonstrates consistent gains over existing evolutionary merging baselines across a suite of zero‑shot and few‑shot benchmarks, including both seen and unseen tasks.
Methodology
-
Problem framing – Given a set of pre‑trained LLMs (the “experts”), the goal is to find a set of scalar coefficients (w) that linearly combine their weights:
[ \theta_{\text{merged}} = \sum_i w_i , \theta_i ]
The search space is the simplex of coefficients (they sum to 1 and are non‑negative). -
Dual‑generator architecture
- Generator G₁ proposes candidate coefficient vectors from a latent noise vector.
- Generator G₂ acts as a reverse model, trying to reconstruct the latent code from the performance feedback of the merged model.
- A cycle‑consistency loss forces (G₂(G₁(z)) \approx z), ensuring that generated coefficients lie in a region that the evaluator can meaningfully assess.
-
Evolutionary loop
- Population initialization: Random coefficients are sampled and evaluated on a validation set.
- Selection: The top‑k “winners” are paired with lower‑performing “losers” to form training pairs for the generators.
- Generation: G₁ samples new candidates; G₂ refines them using the cycle loss.
- Evaluation: Each new merged model is scored (e.g., accuracy, perplexity) and the elite set is fed back as the next generation’s experts.
-
Data efficiency tricks
- Winner‑loser pairs capture the gradient of performance across the coefficient space without explicit gradients.
- Replay buffer stores past high‑quality coefficients to prevent forgetting.
The whole pipeline runs without any back‑propagation through the massive LLM weights; only the lightweight generators are trained.
Results & Findings
| Benchmark | Baseline (e.g., Simple Averaging) | EvoGM | Relative Gain |
|---|---|---|---|
| SuperGLUE (zero‑shot) | 78.2 % | 81.6 % | +3.4 % |
| MMLU (few‑shot) | 71.5 % | 74.9 % | +3.4 % |
| Unseen domain (medical QA) | 62.1 % | 66.8 % | +4.7 % |
| Model size scaling (2‑way vs 4‑way merge) | 0.9 % drop | +0.5 % improvement | — |
- Robustness: EvoGM’s gains persist even when merging models with heterogeneous architectures (e.g., GPT‑Neo + LLaMA).
- Sample efficiency: Achieves comparable performance to exhaustive random search with < 10 % of the evaluations.
- Stability: The cycle‑consistent generators produce smoother coefficient distributions, reducing the variance of merged model performance across runs.
Practical Implications
- Plug‑and‑play model ensembles – Teams can instantly create a “super‑model” from a collection of fine‑tuned LLMs (e.g., domain‑specific adapters) without costly retraining.
- Resource‑constrained deployment – Merging can yield a single model that inherits strengths of several experts, cutting down memory and inference latency compared to running multiple models in parallel.
- Rapid prototyping – Developers can experiment with different expert mixes (e.g., code‑generation + reasoning models) on the fly, using EvoGM as an automated optimizer.
- Continuous improvement pipelines – As new fine‑tuned checkpoints become available, EvoGM can ingest them as new experts, automatically updating the merged model in production.
Limitations & Future Work
- Coefficient linearity – The current formulation only supports linear weight interpolation; non‑linear blending (e.g., LoRA‑style adapters) remains unexplored.
- Scalability to dozens of experts – While the method works well for up to ~8 models, the combinatorial explosion of coefficient space may require hierarchical merging strategies.
- Evaluation cost – Although far cheaper than full fine‑tuning, each candidate still needs a forward pass on a validation set, which can be expensive for very large LLMs.
- Future directions suggested by the authors include extending the generative framework to learn structured merging operators (e.g., layer‑wise masks), integrating reinforcement‑learning‑style reward shaping for downstream metrics, and applying EvoGM to multimodal foundation models.
Authors
- Tao Jiang
- Xinmeng Yu
- Chenhao Yi
- Yiling Wu
- Yan Li
- Ran Cheng
- Dongmei Jiang
- Jianguo Zhang
Paper Information
- arXiv ID: 2605.29295v1
- Categories: cs.NE
- Published: May 28, 2026
- PDF: Download PDF