[Paper] Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

Published: (February 3, 2026 at 01:18 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.03815v1

Overview

Training multimodal large language models (MLLMs) that can understand both text and images is notoriously slow because the models are huge and they have to process thousands of visual tokens per image. This paper introduces DualSpeed, a “fast‑slow” training framework that dramatically cuts training time by pruning visual tokens during the bulk of training while still guaranteeing that the final model works well on full‑resolution images.

Key Contributions

  • Dual‑mode training architecture: a fast‑mode that uses existing visual‑token‑pruning (VTP) techniques to speed up most of the training, and a slow‑mode that periodically trains on the full set of visual tokens to keep the model’s behavior consistent at inference time.
  • Mode isolator: a lightweight mechanism that prevents the fast‑mode’s pruning‑specific parameters from contaminating the slow‑mode, ensuring clean separation of the two learning streams.
  • Self‑distillation bridge: the slow‑mode receives distilled knowledge from the fast‑mode, allowing it to converge quickly despite being trained on far fewer steps.
  • Plug‑and‑play design: DualSpeed works with any off‑the‑shelf VTP method (e.g., TokenLearner, DynamicViT) without modifying the underlying MLLM architecture.
  • Empirical gains: on LLaVA‑1.5 the training speedup is 2.1×, and on the larger LLaVA‑NeXT it reaches 4.0×, while preserving ≥ 99 % of the original performance on standard multimodal benchmarks.

Methodology

  1. Fast‑mode (primary)

    • The image encoder first generates a dense set of visual tokens.
    • A VTP plugin (any token‑pruning algorithm) selects a small subset (e.g., 25 % of tokens) and discards the rest.
    • The language model receives this trimmed token sequence, dramatically reducing memory bandwidth and compute.
  2. Mode isolator

    • Two separate parameter “views” are maintained: one for the fast‑mode and one for the slow‑mode.
    • Gradient updates from the fast‑mode are masked so they do not affect the slow‑mode’s weights, and vice‑versa.
  3. Slow‑mode (auxiliary)

    • Periodically (e.g., every N steps) the same batch is processed without token pruning, feeding the full token set to the model.
    • This step enforces that the model learns to handle the complete visual context, eliminating the train‑inference mismatch.
  4. Self‑distillation

    • The fast‑mode, which has already learned a strong multimodal representation on many more updates, acts as a teacher.
    • The slow‑mode’s logits are encouraged (via KL‑divergence loss) to match the fast‑mode’s predictions, accelerating convergence despite fewer slow‑mode steps.
  5. Training schedule

    • The majority of iterations run in fast‑mode (≈ 80‑90 %).
    • Slow‑mode runs intermittently, and its loss is combined with the distillation loss.
    • No extra inference‑time overhead is introduced because the final model uses the full visual token stream.

Results & Findings

ModelBaseline training timeDualSpeed training timeSpeed‑upPerformance (relative)
LLaVA‑1.5100 h48 h2.1×99.3 % (on VQAv2, COCO Caption)
LLaVA‑NeXT200 h50 h4.0×99.1 % (on MMBench, ScienceQA)
  • Accuracy drop is less than 1 % across all evaluated tasks, confirming that the slow‑mode successfully bridges the train‑inference gap.
  • Ablation studies show that removing the mode isolator or the self‑distillation component reduces speed‑up benefits and leads to a 3‑5 % performance dip, highlighting their necessity.
  • The framework works with multiple VTP plugins (TokenLearner, DynamicViT, etc.) with comparable gains, proving its plug‑and‑play nature.

Practical Implications

  • Faster prototyping: Teams can iterate on new multimodal architectures in roughly half the time (or less), cutting cloud compute costs dramatically.
  • Scalable training: Large‑scale MLLMs that previously required weeks on multi‑GPU clusters become feasible on smaller clusters, opening the door for startups and research labs with limited resources.
  • Energy savings: Reducing the number of processed visual tokens cuts GPU memory traffic and power consumption, aligning with sustainability goals.
  • Seamless deployment: Because the final model is trained on full visual sequences, there is no runtime penalty—the model can be used exactly like any existing MLLM.
  • Compatibility: Existing pipelines that already use VTP for inference can adopt DualSpeed with minimal code changes (just wrap the training loop with the fast‑slow scheduler).

Limitations & Future Work

  • Token‑pruning dependency: The speed‑up magnitude hinges on how aggressively tokens can be pruned without harming representation quality; extremely fine‑grained visual tasks may limit pruning ratios.
  • Scheduling heuristics: The paper uses a fixed fast‑slow ratio; adaptive schedules that react to loss convergence could yield further gains but are not explored.
  • Generalization to other modalities: While the method works for vision‑language models, extending the fast‑slow paradigm to audio or video tokens remains an open question.
  • Distillation overhead: The self‑distillation loss adds extra forward passes, which, although minor compared to the full training cost, could be optimized further.

Overall, DualSpeed offers a pragmatic recipe for slashing MLLM training time without sacrificing the quality that developers and product teams rely on. The open‑source implementation (GitHub link in the paper) makes it easy to try out on your own multimodal projects.

Authors

  • Dingkun Zhang
  • Shuhan Qi
  • Yulin Wu
  • Xinyu Xiao
  • Xuan Wang
  • Long Chen

Paper Information

  • arXiv ID: 2602.03815v1
  • Categories: cs.CV, cs.LG
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »