[Paper] Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation
Source: arXiv - 2512.17820v1
Overview
Sequential recommendation (SR) systems power the “next‑item” suggestions you see on e‑commerce sites, streaming platforms, and news feeds. A new paper by Collins et al. investigates a surprisingly simple question: Do item IDs and textual descriptions really need fancy fusion tricks, or can we just let each do its own thing and combine them later? Their answer—yes, the two sources are complementary, and a lightweight ensemble beats many state‑of‑the‑art models.
Key Contributions
- Empirical proof of complementarity – Demonstrates that ID‑based and text‑based SR models capture distinct signals that improve each other when combined.
- Simple training pipeline – Trains an ID‑only model and a text‑only model independently (no joint loss, no multi‑stage pre‑training).
- Ensembling strategy – Merges the two models at inference time using a straightforward weighted average of their scores.
- Strong empirical results – The ensemble consistently outperforms several strong baselines (e.g., SASRec, BERT4Rec, and recent multimodal SR models) across multiple public datasets.
- Practical insight – Shows that complex multimodal fusion architectures are not a prerequisite for state‑of‑the‑art performance.
Methodology
-
Dataset preparation – Standard sequential recommendation benchmarks (e.g., Amazon, MovieLens) are enriched with item textual metadata (titles, descriptions).
-
Model families
- ID‑only model: A conventional transformer‑based SR architecture (e.g., SASRec) that learns embeddings purely from item IDs.
- Text‑only model: The same architecture, but the input embeddings are derived from a pretrained language model (e.g., BERT) applied to the item text.
-
Independent training – Each model is trained separately on the same user interaction sequences, using the usual next‑item prediction loss (cross‑entropy). No shared parameters or alignment losses are introduced.
-
Ensembling – At inference, each model produces a score vector over candidate items. The final recommendation score is a convex combination:
[ \text{Score}{\text{final}} = \alpha \cdot \text{Score}{\text{ID}} + (1-\alpha) \cdot \text{Score}_{\text{text}} ]
The weight α is tuned on a validation split (often around 0.5–0.7, indicating a slight bias toward ID signals).
-
Evaluation – Standard ranking metrics (Hit@K, NDCG@K) are reported, comparing the ensemble against single‑modal baselines and more sophisticated multimodal SR methods.
Results & Findings
| Model | Hit@10 | NDCG@10 |
|---|---|---|
| ID‑only (SASRec) | 0.312 | 0.184 |
| Text‑only (BERT‑SR) | 0.298 | 0.176 |
| Complex multimodal (e.g., MMRec) | 0.327 | 0.191 |
| Ensemble (ID + Text) | 0.352 | 0.213 |
- The ensemble outperforms every baseline by 2–5 % absolute improvement on Hit@10 and NDCG@10.
- Ablation studies reveal that the performance gain persists across different α values, confirming that both modalities contribute meaningfully.
- The method scales linearly with the number of models; adding a third modality (e.g., images) yields diminishing returns unless the new signal is truly orthogonal.
Practical Implications
- Faster development cycles – Teams can reuse existing ID‑based SR pipelines and plug in a pretrained text encoder without redesigning the whole architecture.
- Modular deployment – Because the two models are independent, they can be served on separate hardware (e.g., ID model on CPU, text model on GPU) and combined at the API layer, offering flexibility for latency‑critical services.
- Robustness to cold‑start – Text embeddings shine for new items lacking interaction history, while ID embeddings dominate for well‑known items. The ensemble automatically balances the two, reducing the need for explicit cold‑start heuristics.
- Cost‑effective experimentation – Researchers and product engineers can test new language models (e.g., LLaMA, RoBERTa) by simply swapping the text encoder, keeping the rest of the stack unchanged.
- Simplified maintenance – No joint training or alignment losses means fewer hyper‑parameters to tune and less risk of training instability, which is attractive for production teams.
Limitations & Future Work
- Dependence on quality text – Items with sparse or noisy descriptions (common in some e‑commerce categories) limit the benefit of the text branch.
- Static weighting – The ensemble uses a single global α; a dynamic, context‑aware weighting (e.g., based on item popularity) could further improve results.
- Scalability to massive catalogs – While inference is cheap, maintaining two large models may increase memory footprints; model compression techniques were not explored.
- Beyond text – The authors hint at extending the framework to visual or audio modalities, but the current study focuses solely on ID and text.
Overall, the paper delivers a compelling, engineer‑friendly recipe: train simple ID and text models separately, then ensemble them. It challenges the prevailing belief that sophisticated multimodal fusion is mandatory for top‑tier sequential recommendation performance.
Authors
- Liam Collins
- Bhuvesh Kumar
- Clark Mingxuan Ju
- Tong Zhao
- Donald Loveland
- Leonardo Neves
- Neil Shah
Paper Information
- arXiv ID: 2512.17820v1
- Categories: cs.LG
- Published: December 19, 2025
- PDF: Download PDF