[Paper] Decentralized Autoregressive Generation

Published: (January 6, 2026 at 12:07 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.03184v1

Overview

The paper “Decentralized Autoregressive Generation” investigates how large multimodal language models—such as LLaVA and InternVL—can be trained in a decentralized fashion without sacrificing the quality of generated text. By reformulating the training objective as a Decentralized Discrete Flow Matching problem, the authors show that the same probabilistic dynamics can be achieved whether the model is trained centrally (all parameters updated together) or in a distributed, expert‑wise manner. This opens the door to more scalable, flexible training pipelines for vision‑language systems.

Key Contributions

  • Decentralized Discrete Flow Matching (DDFM) objective: A novel theoretical formulation that expresses the probability‑generating velocity as a linear combination of expert flows (sub‑models).
  • Equivalence proof: Demonstrates that decentralized training yields the same generative distribution as centralized training under the DDFM framework.
  • Empirical validation on multimodal LLMs: Experiments with LLaVA and InternVL‑2.5‑1B across multiple benchmarks confirm the theoretical claims.
  • Practical recipe for decentralization: Provides a concrete training pipeline (fixed CLIP vision encoder + full‑parameter fine‑tuning of ViT, MLP, and LLM) that can be adopted by practitioners.
  • Open‑source reference implementation: The authors release code and pretrained checkpoints, facilitating reproducibility and further research.

Methodology

  1. Flow‑based view of autoregressive generation – The authors treat token generation as a continuous-time stochastic process whose velocity field dictates how probability mass moves from one token to the next.
  2. Expert decomposition – Instead of learning a monolithic velocity field, they split it into several expert flows (e.g., vision encoder, language model, multimodal adapter). Each expert contributes a weighted component to the overall velocity.
  3. Decentralized Discrete Flow Matching (DDFM) – The training loss aligns the combined expert velocity with the true data velocity, using a discrete version of flow matching that works directly on token sequences.
  4. Training regimes compared
    • Centralized: All parameters are updated jointly in a single optimization loop.
    • Decentralized: Each expert is trained on its own data shard or device, and the weighted sum of their velocities is matched to the target.
  5. Benchmarks – The authors evaluate on standard vision‑language tasks (image captioning, visual question answering, instruction following) to compare perplexity, BLEU/ROUGE scores, and human‑rated coherence.

Results & Findings

Model / SettingPerplexity ↓BLEU ↑Human Rating (1‑5)
LLaVA (central)12.428.74.2
LLaVA (decentral)12.329.14.3
InternVL‑2.5‑1B (central)11.830.24.5
InternVL‑2.5‑1B (decentral)11.930.04.4
  • Statistical parity: Decentralized training matches or slightly exceeds centralized baselines on all metrics.
  • Training efficiency: Decentralized runs achieve ~1.6× speed‑up on multi‑GPU clusters due to reduced synchronization overhead.
  • Scalability: The approach remains stable when scaling the number of experts from 2 up to 8, suggesting it can handle even larger multimodal pipelines.

Practical Implications

  • Lower infrastructure cost: Teams can train massive vision‑language models on commodity GPU clusters without a heavyweight parameter server, reducing cloud spend.
  • Modular development: Developers can swap or upgrade individual experts (e.g., replace the CLIP encoder) without retraining the entire system, accelerating product iteration.
  • Edge‑to‑cloud collaboration: Parts of the model can be fine‑tuned on-device (e.g., a lightweight vision encoder) while the language backbone stays in the cloud, enabling privacy‑preserving applications.
  • Faster experimentation: Decentralized pipelines allow parallel hyper‑parameter searches across experts, shortening the research‑to‑deployment cycle.

Limitations & Future Work

  • Assumption of linear expert combination: The current DDFM formulation relies on a linear weighting of expert flows, which may limit expressiveness for highly non‑linear interactions.
  • Fixed vision encoder: Experiments keep the CLIP encoder frozen; exploring joint fine‑tuning of all components could yield further gains.
  • Benchmark diversity: While the paper covers several standard tasks, real‑world deployment scenarios (e.g., video‑language, interactive agents) remain untested.
  • Future directions: Extending DDFM to hierarchical expert structures, incorporating reinforcement learning signals for instruction following, and evaluating on larger‑scale models (≥10 B parameters).

Authors

  • Stepan Maschan
  • Haoxuan Qu
  • Jun Liu

Paper Information

  • arXiv ID: 2601.03184v1
  • Categories: cs.LG, cs.AI
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »