[Paper] Decentralized Autoregressive Generation
Source: arXiv - 2601.03184v1
Overview
The paper “Decentralized Autoregressive Generation” investigates how large multimodal language models—such as LLaVA and InternVL—can be trained in a decentralized fashion without sacrificing the quality of generated text. By reformulating the training objective as a Decentralized Discrete Flow Matching problem, the authors show that the same probabilistic dynamics can be achieved whether the model is trained centrally (all parameters updated together) or in a distributed, expert‑wise manner. This opens the door to more scalable, flexible training pipelines for vision‑language systems.
Key Contributions
- Decentralized Discrete Flow Matching (DDFM) objective: A novel theoretical formulation that expresses the probability‑generating velocity as a linear combination of expert flows (sub‑models).
- Equivalence proof: Demonstrates that decentralized training yields the same generative distribution as centralized training under the DDFM framework.
- Empirical validation on multimodal LLMs: Experiments with LLaVA and InternVL‑2.5‑1B across multiple benchmarks confirm the theoretical claims.
- Practical recipe for decentralization: Provides a concrete training pipeline (fixed CLIP vision encoder + full‑parameter fine‑tuning of ViT, MLP, and LLM) that can be adopted by practitioners.
- Open‑source reference implementation: The authors release code and pretrained checkpoints, facilitating reproducibility and further research.
Methodology
- Flow‑based view of autoregressive generation – The authors treat token generation as a continuous-time stochastic process whose velocity field dictates how probability mass moves from one token to the next.
- Expert decomposition – Instead of learning a monolithic velocity field, they split it into several expert flows (e.g., vision encoder, language model, multimodal adapter). Each expert contributes a weighted component to the overall velocity.
- Decentralized Discrete Flow Matching (DDFM) – The training loss aligns the combined expert velocity with the true data velocity, using a discrete version of flow matching that works directly on token sequences.
- Training regimes compared
- Centralized: All parameters are updated jointly in a single optimization loop.
- Decentralized: Each expert is trained on its own data shard or device, and the weighted sum of their velocities is matched to the target.
- Benchmarks – The authors evaluate on standard vision‑language tasks (image captioning, visual question answering, instruction following) to compare perplexity, BLEU/ROUGE scores, and human‑rated coherence.
Results & Findings
| Model / Setting | Perplexity ↓ | BLEU ↑ | Human Rating (1‑5) |
|---|---|---|---|
| LLaVA (central) | 12.4 | 28.7 | 4.2 |
| LLaVA (decentral) | 12.3 | 29.1 | 4.3 |
| InternVL‑2.5‑1B (central) | 11.8 | 30.2 | 4.5 |
| InternVL‑2.5‑1B (decentral) | 11.9 | 30.0 | 4.4 |
- Statistical parity: Decentralized training matches or slightly exceeds centralized baselines on all metrics.
- Training efficiency: Decentralized runs achieve ~1.6× speed‑up on multi‑GPU clusters due to reduced synchronization overhead.
- Scalability: The approach remains stable when scaling the number of experts from 2 up to 8, suggesting it can handle even larger multimodal pipelines.
Practical Implications
- Lower infrastructure cost: Teams can train massive vision‑language models on commodity GPU clusters without a heavyweight parameter server, reducing cloud spend.
- Modular development: Developers can swap or upgrade individual experts (e.g., replace the CLIP encoder) without retraining the entire system, accelerating product iteration.
- Edge‑to‑cloud collaboration: Parts of the model can be fine‑tuned on-device (e.g., a lightweight vision encoder) while the language backbone stays in the cloud, enabling privacy‑preserving applications.
- Faster experimentation: Decentralized pipelines allow parallel hyper‑parameter searches across experts, shortening the research‑to‑deployment cycle.
Limitations & Future Work
- Assumption of linear expert combination: The current DDFM formulation relies on a linear weighting of expert flows, which may limit expressiveness for highly non‑linear interactions.
- Fixed vision encoder: Experiments keep the CLIP encoder frozen; exploring joint fine‑tuning of all components could yield further gains.
- Benchmark diversity: While the paper covers several standard tasks, real‑world deployment scenarios (e.g., video‑language, interactive agents) remain untested.
- Future directions: Extending DDFM to hierarchical expert structures, incorporating reinforcement learning signals for instruction following, and evaluating on larger‑scale models (≥10 B parameters).
Authors
- Stepan Maschan
- Haoxuan Qu
- Jun Liu
Paper Information
- arXiv ID: 2601.03184v1
- Categories: cs.LG, cs.AI
- Published: January 6, 2026
- PDF: Download PDF