[Paper] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

Published: 1 month ago (December 11, 2025 at 11:39 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.10793v1

Overview

The paper introduces LabelFusion, a plug‑and‑play ensemble that learns to blend a conventional transformer classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs) such as GPT‑4, Gemini, or DeepSeek. By fusing the two signal streams, the system delivers higher‑quality text‑classification predictions while letting users balance accuracy, latency, and API‑costs—making it attractive for production‑grade NLP pipelines.

Key Contributions

AutoFusionClassifier API – a high‑level, zero‑config entry point that trains the whole fusion pipeline end‑to‑end.
Hybrid representation – concatenates the transformer’s contextual embeddings with LLM‑generated per‑class scores (obtained via structured prompts).
FusionMLP – a lightweight multi‑layer perceptron that learns the optimal weighting of the two sources, rather than relying on hand‑crafted heuristics.
Cost‑aware inference – the framework can dynamically switch between “high‑accuracy” (LLM‑involved) and “low‑latency/low‑cost” (transformer‑only) modes.
Strong empirical results – achieves 92.4 % accuracy on AG News and 92.3 % on a 10‑class Reuters‑21578 split, outperforming both the standalone transformer and the LLM baselines.

Methodology

Backbone Transformer – A standard fine‑tuned transformer (e.g., RoBERTa‑base) processes the input text and outputs a pooled embedding vector.
LLM Prompting – For each target class, a concise prompt (e.g., “Is the article about Sports? Answer Yes/No”) is sent to the chosen LLM. The LLM’s textual response is parsed into a confidence score per class.
Feature Fusion – The transformer embedding (≈768‑dim) is concatenated with the vector of LLM scores (one entry per class).
FusionMLP – A shallow MLP (typically 2–3 layers, ReLU activations) consumes the fused vector and outputs the final class probabilities. The entire pipeline—transformer, prompting logic (treated as a differentiable proxy during training), and FusionMLP—is optimized jointly using cross‑entropy loss.
Training & Inference Modes – During training, LLM scores are simulated with a “teacher‑model” that mimics the LLM’s behavior, keeping the process fully differentiable. At inference time, real LLM calls can be toggled on/off per request, enabling the cost‑aware trade‑off.

Results & Findings

Dataset	Baseline RoBERTa	Baseline LLM (zero‑shot)	LabelFusion (full)
AG News (4‑class)	90.1 %	88.5 %	92.4 %
Reuters‑21578 (10‑class)	90.7 %	89.2 %	92.3 %

Robustness: LabelFusion maintains performance when individual components degrade (e.g., when the LLM is throttled or the transformer is under‑trained).
Latency/Cost trade‑off: In “fast” mode (transformer only) accuracy drops only ~1 % while latency halves and API cost disappears.
Ablation: Removing the LLM scores reduces accuracy by ~1.8 %; removing the transformer embeddings reduces it by ~2.2 %, confirming complementary strengths.

Practical Implications

Plug‑and‑play for production – Developers can replace a single‑model classifier with AutoFusionClassifier and immediately gain a measurable boost without redesigning data pipelines.
Dynamic cost control – SaaS platforms can expose a “budget” knob that decides whether to invoke the LLM for each request, enabling per‑request cost optimization.
Multi‑label extensions – The same fusion logic works for multi‑label tasks (e.g., tagging news articles with multiple topics), making it suitable for recommendation engines and content moderation.
Domain adaptation – Because the LLM brings world knowledge, the fused model adapts faster to emerging vocabularies (e.g., new tech terms) without extensive re‑training of the transformer.

Limitations & Future Work

Prompt engineering overhead – Crafting high‑quality per‑class prompts still requires manual effort; automated prompt generation is an open research direction.
LLM latency variability – Real‑time LLM calls can be unpredictable, especially under heavy load; the paper suggests caching strategies but does not evaluate them extensively.
Scalability to hundreds of classes – Concatenating a score per class may become memory‑intensive for very large label spaces; future work could explore hierarchical or sparse fusion mechanisms.
Differentiable LLM proxy – The training proxy approximates LLM behavior; mismatches between proxy and actual LLM responses could affect final performance, a gap the authors plan to close with reinforcement‑learning fine‑tuning.

Authors

Michael Schlee
Christoph Weisser
Timo Kivimäki
Melchizedek Mashiku
Benjamin Saefken

Paper Information

arXiv ID: 2512.10793v1
Categories: cs.CL, cs.AI
Published: December 11, 2025
PDF: Download PDF

[Paper] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models