[Paper] Decentralized Autoregressive Generation

Published: 1 month ago (January 6, 2026 at 12:07 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.03184v1

Overview

The paper “Decentralized Autoregressive Generation” investigates how large multimodal language models—such as LLaVA and InternVL—can be trained in a decentralized fashion without sacrificing the quality of generated text. By reformulating the training objective as a Decentralized Discrete Flow Matching problem, the authors show that the same probabilistic dynamics can be achieved whether the model is trained centrally (all parameters updated together) or in a distributed, expert‑wise manner. This opens the door to more scalable, flexible training pipelines for vision‑language systems.

Key Contributions

Decentralized Discrete Flow Matching (DDFM) objective: A novel theoretical formulation that expresses the probability‑generating velocity as a linear combination of expert flows (sub‑models).
Equivalence proof: Demonstrates that decentralized training yields the same generative distribution as centralized training under the DDFM framework.
Empirical validation on multimodal LLMs: Experiments with LLaVA and InternVL‑2.5‑1B across multiple benchmarks confirm the theoretical claims.
Practical recipe for decentralization: Provides a concrete training pipeline (fixed CLIP vision encoder + full‑parameter fine‑tuning of ViT, MLP, and LLM) that can be adopted by practitioners.
Open‑source reference implementation: The authors release code and pretrained checkpoints, facilitating reproducibility and further research.

Methodology

Flow‑based view of autoregressive generation – The authors treat token generation as a continuous-time stochastic process whose velocity field dictates how probability mass moves from one token to the next.
Expert decomposition – Instead of learning a monolithic velocity field, they split it into several expert flows (e.g., vision encoder, language model, multimodal adapter). Each expert contributes a weighted component to the overall velocity.
Decentralized Discrete Flow Matching (DDFM) – The training loss aligns the combined expert velocity with the true data velocity, using a discrete version of flow matching that works directly on token sequences.
Training regimes compared
- Centralized: All parameters are updated jointly in a single optimization loop.
- Decentralized: Each expert is trained on its own data shard or device, and the weighted sum of their velocities is matched to the target.
Benchmarks – The authors evaluate on standard vision‑language tasks (image captioning, visual question answering, instruction following) to compare perplexity, BLEU/ROUGE scores, and human‑rated coherence.

Results & Findings

Model / Setting	Perplexity ↓	BLEU ↑	Human Rating (1‑5)
LLaVA (central)	12.4	28.7	4.2
LLaVA (decentral)	12.3	29.1	4.3
InternVL‑2.5‑1B (central)	11.8	30.2	4.5
InternVL‑2.5‑1B (decentral)	11.9	30.0	4.4

Statistical parity: Decentralized training matches or slightly exceeds centralized baselines on all metrics.
Training efficiency: Decentralized runs achieve ~1.6× speed‑up on multi‑GPU clusters due to reduced synchronization overhead.
Scalability: The approach remains stable when scaling the number of experts from 2 up to 8, suggesting it can handle even larger multimodal pipelines.

Practical Implications

Lower infrastructure cost: Teams can train massive vision‑language models on commodity GPU clusters without a heavyweight parameter server, reducing cloud spend.
Modular development: Developers can swap or upgrade individual experts (e.g., replace the CLIP encoder) without retraining the entire system, accelerating product iteration.
Edge‑to‑cloud collaboration: Parts of the model can be fine‑tuned on-device (e.g., a lightweight vision encoder) while the language backbone stays in the cloud, enabling privacy‑preserving applications.
Faster experimentation: Decentralized pipelines allow parallel hyper‑parameter searches across experts, shortening the research‑to‑deployment cycle.

Limitations & Future Work

Assumption of linear expert combination: The current DDFM formulation relies on a linear weighting of expert flows, which may limit expressiveness for highly non‑linear interactions.
Fixed vision encoder: Experiments keep the CLIP encoder frozen; exploring joint fine‑tuning of all components could yield further gains.
Benchmark diversity: While the paper covers several standard tasks, real‑world deployment scenarios (e.g., video‑language, interactive agents) remain untested.
Future directions: Extending DDFM to hierarchical expert structures, incorporating reinforcement learning signals for instruction following, and evaluating on larger‑scale models (≥10 B parameters).

Authors

Stepan Maschan
Haoxuan Qu
Jun Liu

Paper Information

arXiv ID: 2601.03184v1
Categories: cs.LG, cs.AI
Published: January 6, 2026
PDF: Download PDF

[Paper] Decentralized Autoregressive Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Manifold limit for the training of shallow graph convolutional neural networks

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection

[Paper] Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem