[Paper] Ensembling Language Models with Sequential Monte Carlo

Published: 16 hours ago (March 5, 2026 at 12:54 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.05432v1

Overview

The paper presents a principled way to ensemble multiple language models during text generation. By treating the ensemble as a single probabilistic model and using a Sequential Monte Carlo (SMC) sampler that works at the byte level, the authors can combine models with different vocabularies and obtain unbiased samples—something that standard probability‑averaging tricks can’t guarantee.

Key Contributions

Unified (f)-ensemble framework: Formalizes how to merge the next‑token distributions of (K) models using any non‑negative aggregation function (f) (e.g., arithmetic mean, geometric mean, max, learned weighting).
Byte‑level Sequential Monte Carlo sampler: Introduces an SMC algorithm that operates on a shared character (byte) space, enabling:
- Consistent sampling from the true ensemble distribution in the limit.
- Compatibility across models with mismatched tokenizers/vocabularies.
Empirical evaluation across tasks: Tests a variety of (f)-ensembles on structured generation benchmarks (code synthesis, SQL generation, data‑to‑text) and shows that many alternatives to simple probability averaging improve downstream performance.
Analysis of posterior approximation quality: Demonstrates that ensembles that better approximate the true posterior over strings tend to yield higher accuracy, linking theoretical soundness to practical gains.

Methodology

Define the ensemble distribution
For each position in a generated string, each model (M_i) provides a probability vector (\mathbf{p}i) over its own token space. The paper maps all models to a common byte‑level space (\mathcal{B}) (256 possible byte values) using each model’s tokenizer. The ensemble’s next‑byte distribution is then computed as
[ \mathbf{p}{\text{ens}} = f(\mathbf{p}_1, \dots, \mathbf{p}_K), ] where (f) can be any monotone, non‑negative function (e.g., product, weighted sum).
Sequential Monte Carlo sampling
- Particles: Each particle represents a partially generated byte sequence.
- Proposal: At each step, propose the next byte for every particle using the ensemble distribution (\mathbf{p}_{\text{ens}}).
- Weight update: Compute importance weights based on the true joint probability of the particle under the ensemble.
- Resampling: Periodically resample particles proportionally to their weights to focus computation on high‑probability continuations.
- As the number of particles (N \to \infty), the particle set converges to samples from the exact ensemble distribution, regardless of tokenizer differences.
Experimental setup
- Models: Mix of open‑source (e.g., LLaMA, GPT‑Neo) and proprietary (e.g., GPT‑3.5) models, each with its own tokenizer.
- Tasks: Structured generation benchmarks where correctness can be measured automatically (e.g., generating valid JSON, SQL queries, or Python code from a prompt).
- Baselines: Single‑model decoding, naïve probability averaging, and top‑k / nucleus sampling without ensembling.

Results & Findings

Ensemble type	Task	Metric improvement vs. best single model
Arithmetic mean (probability averaging)	Code generation	+2.1 % exact match
Geometric mean	SQL generation	+3.4 % execution accuracy
Learned weighted sum (trained on a validation set)	Data‑to‑text	+4.0 % BLEU
Max‑vote (choose token with highest max probability across models)	JSON generation	+1.8 % structural correctness

SMC vs. greedy decoding: Using 128 particles reduced the bias of the ensemble distribution, yielding up to 5 % higher task accuracy compared to greedy decoding with the same (f).
Tokenizer mismatch: Byte‑level SMC handled models with vocabularies ranging from 32 k to 50 k tokens without any preprocessing, confirming the practicality of the approach.
Ablation: Removing the resampling step caused particle degeneracy and degraded performance, underscoring the importance of the full SMC pipeline.

Practical Implications

Robust production pipelines: Teams can safely combine heterogeneous LLMs (e.g., an internal fine‑tuned model + a hosted API) without having to align tokenizers, improving reliability for mission‑critical generation (e.g., automated report writing).
Cost‑effective performance boosts: Instead of scaling a single model to billions of parameters, developers can ensemble several smaller, cheaper models and achieve comparable or better quality, especially for structured outputs.
Customizable aggregation: The (f)-ensemble framework lets product teams experiment with domain‑specific weighting schemes (e.g., give higher weight to a model trained on legal text when generating contracts).
Open‑source tooling: The byte‑level SMC algorithm can be wrapped as a drop‑in decoder for existing inference libraries (e.g., Hugging Face Transformers), making it straightforward to integrate into existing pipelines.

Limitations & Future Work

Computational overhead: SMC requires maintaining multiple particles and periodic resampling, which can increase latency compared to single‑model greedy decoding. Optimizing particle count vs. quality is an open engineering challenge.
Scalability to very large ensembles: The paper experiments with up to 4 models; the memory and compute cost of handling many tokenizers simultaneously may become prohibitive.
Learning the aggregation function: While a simple weighted sum was explored, more sophisticated, possibly context‑dependent (f) functions (e.g., neural gating networks) were left for future research.
Evaluation breadth: The study focuses on structured generation tasks; applying the method to open‑ended chat or creative writing scenarios could reveal different trade‑offs.

Bottom line: By marrying a solid probabilistic formulation with a practical byte‑level SMC sampler, this work shows that *ensembling language models is no longer a theoretical curiosity—it can be deployed today to make LLM‑driven products more accurate, robust, and flexible.

Authors

Robin Shing Moon Chan
Tianyu Liu
Samuel Kiegeland
Clemente Pasti
Jacob Hoover Vigly
Timothy J. O’Donnell
Ryan Cotterell
Tim Vieira

Paper Information

arXiv ID: 2603.05432v1
Categories: cs.CL, cs.AI, cs.LG
Published: March 5, 2026
PDF: Download PDF

[Paper] Ensembling Language Models with Sequential Monte Carlo

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought