[Paper] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification
Source: arXiv - 2512.10793v1
Overview
The paper introduces LabelFusion, a plug‑and‑play ensemble that learns to blend a conventional transformer classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs) such as GPT‑4, Gemini, or DeepSeek. By fusing the two signal streams, the system delivers higher‑quality text‑classification predictions while letting users balance accuracy, latency, and API‑costs—making it attractive for production‑grade NLP pipelines.
Key Contributions
- AutoFusionClassifier API – a high‑level, zero‑config entry point that trains the whole fusion pipeline end‑to‑end.
- Hybrid representation – concatenates the transformer’s contextual embeddings with LLM‑generated per‑class scores (obtained via structured prompts).
- FusionMLP – a lightweight multi‑layer perceptron that learns the optimal weighting of the two sources, rather than relying on hand‑crafted heuristics.
- Cost‑aware inference – the framework can dynamically switch between “high‑accuracy” (LLM‑involved) and “low‑latency/low‑cost” (transformer‑only) modes.
- Strong empirical results – achieves 92.4 % accuracy on AG News and 92.3 % on a 10‑class Reuters‑21578 split, outperforming both the standalone transformer and the LLM baselines.
Methodology
- Backbone Transformer – A standard fine‑tuned transformer (e.g., RoBERTa‑base) processes the input text and outputs a pooled embedding vector.
- LLM Prompting – For each target class, a concise prompt (e.g., “Is the article about Sports? Answer Yes/No”) is sent to the chosen LLM. The LLM’s textual response is parsed into a confidence score per class.
- Feature Fusion – The transformer embedding (≈768‑dim) is concatenated with the vector of LLM scores (one entry per class).
- FusionMLP – A shallow MLP (typically 2–3 layers, ReLU activations) consumes the fused vector and outputs the final class probabilities. The entire pipeline—transformer, prompting logic (treated as a differentiable proxy during training), and FusionMLP—is optimized jointly using cross‑entropy loss.
- Training & Inference Modes – During training, LLM scores are simulated with a “teacher‑model” that mimics the LLM’s behavior, keeping the process fully differentiable. At inference time, real LLM calls can be toggled on/off per request, enabling the cost‑aware trade‑off.
Results & Findings
| Dataset | Baseline RoBERTa | Baseline LLM (zero‑shot) | LabelFusion (full) |
|---|---|---|---|
| AG News (4‑class) | 90.1 % | 88.5 % | 92.4 % |
| Reuters‑21578 (10‑class) | 90.7 % | 89.2 % | 92.3 % |
- Robustness: LabelFusion maintains performance when individual components degrade (e.g., when the LLM is throttled or the transformer is under‑trained).
- Latency/Cost trade‑off: In “fast” mode (transformer only) accuracy drops only ~1 % while latency halves and API cost disappears.
- Ablation: Removing the LLM scores reduces accuracy by ~1.8 %; removing the transformer embeddings reduces it by ~2.2 %, confirming complementary strengths.
Practical Implications
- Plug‑and‑play for production – Developers can replace a single‑model classifier with
AutoFusionClassifierand immediately gain a measurable boost without redesigning data pipelines. - Dynamic cost control – SaaS platforms can expose a “budget” knob that decides whether to invoke the LLM for each request, enabling per‑request cost optimization.
- Multi‑label extensions – The same fusion logic works for multi‑label tasks (e.g., tagging news articles with multiple topics), making it suitable for recommendation engines and content moderation.
- Domain adaptation – Because the LLM brings world knowledge, the fused model adapts faster to emerging vocabularies (e.g., new tech terms) without extensive re‑training of the transformer.
Limitations & Future Work
- Prompt engineering overhead – Crafting high‑quality per‑class prompts still requires manual effort; automated prompt generation is an open research direction.
- LLM latency variability – Real‑time LLM calls can be unpredictable, especially under heavy load; the paper suggests caching strategies but does not evaluate them extensively.
- Scalability to hundreds of classes – Concatenating a score per class may become memory‑intensive for very large label spaces; future work could explore hierarchical or sparse fusion mechanisms.
- Differentiable LLM proxy – The training proxy approximates LLM behavior; mismatches between proxy and actual LLM responses could affect final performance, a gap the authors plan to close with reinforcement‑learning fine‑tuning.
Authors
- Michael Schlee
- Christoph Weisser
- Timo Kivimäki
- Melchizedek Mashiku
- Benjamin Saefken
Paper Information
- arXiv ID: 2512.10793v1
- Categories: cs.CL, cs.AI
- Published: December 11, 2025
- PDF: Download PDF