[Paper] Ensembling Language Models with Sequential Monte Carlo
Source: arXiv - 2603.05432v1
Overview
The paper presents a principled way to ensemble multiple language models during text generation. By treating the ensemble as a single probabilistic model and using a Sequential Monte Carlo (SMC) sampler that works at the byte level, the authors can combine models with different vocabularies and obtain unbiased samples—something that standard probability‑averaging tricks can’t guarantee.
Key Contributions
- Unified (f)-ensemble framework: Formalizes how to merge the next‑token distributions of (K) models using any non‑negative aggregation function (f) (e.g., arithmetic mean, geometric mean, max, learned weighting).
- Byte‑level Sequential Monte Carlo sampler: Introduces an SMC algorithm that operates on a shared character (byte) space, enabling:
- Consistent sampling from the true ensemble distribution in the limit.
- Compatibility across models with mismatched tokenizers/vocabularies.
- Empirical evaluation across tasks: Tests a variety of (f)-ensembles on structured generation benchmarks (code synthesis, SQL generation, data‑to‑text) and shows that many alternatives to simple probability averaging improve downstream performance.
- Analysis of posterior approximation quality: Demonstrates that ensembles that better approximate the true posterior over strings tend to yield higher accuracy, linking theoretical soundness to practical gains.
Methodology
-
Define the ensemble distribution
For each position in a generated string, each model (M_i) provides a probability vector (\mathbf{p}i) over its own token space. The paper maps all models to a common byte‑level space (\mathcal{B}) (256 possible byte values) using each model’s tokenizer. The ensemble’s next‑byte distribution is then computed as
[ \mathbf{p}{\text{ens}} = f(\mathbf{p}_1, \dots, \mathbf{p}_K), ] where (f) can be any monotone, non‑negative function (e.g., product, weighted sum). -
Sequential Monte Carlo sampling
- Particles: Each particle represents a partially generated byte sequence.
- Proposal: At each step, propose the next byte for every particle using the ensemble distribution (\mathbf{p}_{\text{ens}}).
- Weight update: Compute importance weights based on the true joint probability of the particle under the ensemble.
- Resampling: Periodically resample particles proportionally to their weights to focus computation on high‑probability continuations.
- As the number of particles (N \to \infty), the particle set converges to samples from the exact ensemble distribution, regardless of tokenizer differences.
-
Experimental setup
- Models: Mix of open‑source (e.g., LLaMA, GPT‑Neo) and proprietary (e.g., GPT‑3.5) models, each with its own tokenizer.
- Tasks: Structured generation benchmarks where correctness can be measured automatically (e.g., generating valid JSON, SQL queries, or Python code from a prompt).
- Baselines: Single‑model decoding, naïve probability averaging, and top‑k / nucleus sampling without ensembling.
Results & Findings
| Ensemble type | Task | Metric improvement vs. best single model |
|---|---|---|
| Arithmetic mean (probability averaging) | Code generation | +2.1 % exact match |
| Geometric mean | SQL generation | +3.4 % execution accuracy |
| Learned weighted sum (trained on a validation set) | Data‑to‑text | +4.0 % BLEU |
| Max‑vote (choose token with highest max probability across models) | JSON generation | +1.8 % structural correctness |
- SMC vs. greedy decoding: Using 128 particles reduced the bias of the ensemble distribution, yielding up to 5 % higher task accuracy compared to greedy decoding with the same (f).
- Tokenizer mismatch: Byte‑level SMC handled models with vocabularies ranging from 32 k to 50 k tokens without any preprocessing, confirming the practicality of the approach.
- Ablation: Removing the resampling step caused particle degeneracy and degraded performance, underscoring the importance of the full SMC pipeline.
Practical Implications
- Robust production pipelines: Teams can safely combine heterogeneous LLMs (e.g., an internal fine‑tuned model + a hosted API) without having to align tokenizers, improving reliability for mission‑critical generation (e.g., automated report writing).
- Cost‑effective performance boosts: Instead of scaling a single model to billions of parameters, developers can ensemble several smaller, cheaper models and achieve comparable or better quality, especially for structured outputs.
- Customizable aggregation: The (f)-ensemble framework lets product teams experiment with domain‑specific weighting schemes (e.g., give higher weight to a model trained on legal text when generating contracts).
- Open‑source tooling: The byte‑level SMC algorithm can be wrapped as a drop‑in decoder for existing inference libraries (e.g., Hugging Face Transformers), making it straightforward to integrate into existing pipelines.
Limitations & Future Work
- Computational overhead: SMC requires maintaining multiple particles and periodic resampling, which can increase latency compared to single‑model greedy decoding. Optimizing particle count vs. quality is an open engineering challenge.
- Scalability to very large ensembles: The paper experiments with up to 4 models; the memory and compute cost of handling many tokenizers simultaneously may become prohibitive.
- Learning the aggregation function: While a simple weighted sum was explored, more sophisticated, possibly context‑dependent (f) functions (e.g., neural gating networks) were left for future research.
- Evaluation breadth: The study focuses on structured generation tasks; applying the method to open‑ended chat or creative writing scenarios could reveal different trade‑offs.
Bottom line: By marrying a solid probabilistic formulation with a practical byte‑level SMC sampler, this work shows that *ensembling language models is no longer a theoretical curiosity—it can be deployed today to make LLM‑driven products more accurate, robust, and flexible.
Authors
- Robin Shing Moon Chan
- Tianyu Liu
- Samuel Kiegeland
- Clemente Pasti
- Jacob Hoover Vigly
- Timothy J. O’Donnell
- Ryan Cotterell
- Tim Vieira
Paper Information
- arXiv ID: 2603.05432v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: March 5, 2026
- PDF: Download PDF