[Paper] VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Published: 2 days ago (April 21, 2026 at 01:51 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.19728v1

Overview

The paper introduces VLA Foundry, an open‑source toolkit that unifies language models (LLM), vision models (VLM), and vision‑language‑action models (VLA) within a single training pipeline. By eliminating the “glue‑code” nightmare that usually separates pre‑training from action‑fine‑tuning, the framework lets researchers and engineers build end‑to‑end embodied agents from scratch—or by plugging in popular pretrained backbones—while keeping the whole stack reproducible and extensible.

Key Contributions

Unified training stack handling LLM pre‑training, VLM pre‑training, and VLA fine‑tuning in one codebase.
Support for both from‑scratch and pretrained backbones (e.g., Qwen3‑VL) via a simple Hugging Face interface.
Two released model families:
1. A fully‑from‑scratch LLM → VLM → VLA pipeline that matches the authors’ prior closed‑source results.
2. A Qwen3‑VL‑based VLA that achieves a large boost on multi‑task tabletop manipulation.
Open‑source evaluation suite (LBM Eval) and improved simulator/STEP analysis tools for easy benchmarking.
Public release of code, model weights, and demo videos, lowering the entry barrier for the community.

Methodology

VLA Foundry treats the three stages of embodied AI as modular components:

Language Pre‑training (LLM) – Standard causal or encoder‑decoder transformers are trained on large text corpora, optionally using existing checkpoints from Hugging Face.
Vision‑Language Pre‑training (VLM) – A multimodal encoder aligns image patches with token embeddings, leveraging contrastive or image‑text matching objectives.
Vision‑Language‑Action Fine‑tuning (VLA) – The fused LLM‑VLM model is extended with a policy head that predicts low‑level robot actions (e.g., end‑effector poses). Training uses reinforcement‑style trajectories generated in the LBM Eval simulator, applying behavior cloning and RL‑style loss terms.

All three stages share a common data loader, tokenizer, and checkpoint handling logic, so swapping a component (e.g., swapping a pretrained Qwen3‑VL encoder for a custom one) requires only a few config changes. The pipeline is orchestrated via Hydra/YAML configs, and the codebase is built on PyTorch + Accelerate for multi‑GPU scaling.

Results & Findings

Model	Training Regime	Success Rate on LBM Eval (average across tasks)
From‑scratch LLM → VLM → VLA	End‑to‑end from zero	≈ 78 % (on par with the authors’ previous closed‑source system)
Qwen3‑VL‑backboned VLA	Pretrained vision‑language encoder + policy fine‑tune	≈ 92 % (significant margin over baseline)

The from‑scratch pipeline demonstrates that a fully open stack can reach competitive performance without any proprietary components.
Leveraging a strong pretrained vision‑language backbone (Qwen3‑VL) yields a large boost in multi‑task tabletop manipulation, confirming the value of transfer learning for embodied policies.
Qualitative videos show smooth, closed‑loop interaction (e.g., picking up objects, stacking blocks) despite the models being trained on a relatively modest simulated dataset.

Practical Implications

Rapid prototyping: Developers can spin up a new VLA agent by picking a pretrained LLM/VLM from Hugging Face, tweaking a few config flags, and launching the fine‑tuning job—no need to stitch together separate repositories.
Lower compute barrier: The from‑scratch pipeline runs on commodity multi‑GPU rigs, enabling research labs and startups to experiment without massive TPU clusters.
Standardized benchmarking: By bundling LBM Eval and STEP analysis tools, teams can objectively compare policies, facilitating reproducible research and easier CI testing for embodied AI products.
Transfer to real robots: The modular policy head can be swapped for a robot‑specific controller (e.g., ROS2 action server), opening a straightforward path from simulation to hardware deployment.
Community growth: Open weights and a well‑documented codebase invite contributions—new tasks, data augmentations, or custom simulators can be plugged in with minimal friction.

Limitations & Future Work

Simulation‑only evaluation: All experiments are confined to the LBM Eval simulator; real‑world transfer remains untested.
Task scope: The benchmark focuses on tabletop manipulation; extending to navigation, long‑horizon planning, or multi‑agent scenarios may expose scaling challenges.
Compute cost for large backbones: While the framework supports from‑scratch training, fine‑tuning massive models like Qwen3‑VL still demands high‑end GPUs and careful memory management.
Future directions proposed by the authors include: integrating real‑world robot data pipelines, adding curriculum learning for progressively harder tasks, and expanding the framework to support multimodal feedback (e.g., haptic or audio).

Authors

Jean Mercat
Sedrick Keh
Kushal Arora
Isabella Huang
Paarth Shah
Haruki Nishimura
Shun Iwase
Katherine Liu

Paper Information

arXiv ID: 2604.19728v1
Categories: cs.RO, cs.AI, cs.CV, cs.LG, cs.SE
Published: April 21, 2026
PDF: Download PDF

[Paper] VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] Addressing Image Authenticity When Cameras Use Generative AI

[Paper] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos