[Paper] From Pixels to Words -- Towards Native One-Vision Models at Scale

Published: 2 weeks ago (May 27, 2026 at 01:59 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.28820v1

Overview

The paper presents NEO‑ov, a “native” vision‑language foundation model that learns to link pixels and words across multiple frames end‑to‑end, without relying on separate image encoders, language decoders, or post‑hoc fusion modules. By removing the traditional modular pipeline, the authors show that a single unified architecture can achieve competitive (and sometimes superior) performance on fine‑grained visual tasks, opening the door to more seamless multimodal AI at scale.

Key Contributions

One‑Vision Architecture: Introduces the first large‑scale native vision‑language model (NEO‑ov) that processes raw pixels and text jointly, eliminating external encoders and adapters.
Cross‑Frame Pixel‑Word Alignment: Enables fine‑grained correspondence between image patches and words across time, supporting video and multi‑image understanding.
Competitive Performance: Demonstrates that NEO‑ov narrows the accuracy gap with state‑of‑the‑art modular VLMs while excelling in tasks requiring detailed spatial reasoning.
Systematic Architectural Analysis: Provides ablation studies and design guidelines (e.g., tokenization strategy, attention scaling) that help researchers replicate or extend native multimodal models.
Open‑Source Release: Shares training recipes, code, and pretrained checkpoints, facilitating rapid adoption by the community.

Methodology

Unified Transformer Backbone – A single transformer ingests a sequence composed of image patches (flattened pixel grids) and tokenized text. No separate CNN or vision transformer is pre‑trained; all parameters are learned jointly.
Spatiotemporal Tokenization – For video or multi‑image inputs, each frame is split into non‑overlapping patches, and a lightweight positional encoding injects both spatial and temporal information.
Cross‑Modal Attention – Standard multi‑head self‑attention operates over the combined token stream, allowing any word token to attend directly to any pixel patch, regardless of frame. This yields pixel‑word correspondence at every layer.
Training Objective – A contrastive loss (image‑text matching) together with a masked language modeling objective encourages the model to predict missing words from visual context and vice‑versa.
Scaling Strategy – Training on large‑scale image‑text datasets (e.g., LAION) using mixed‑precision and distributed data parallelism, following recipes that balance model depth, width, and batch size for optimal convergence.

Results & Findings

Benchmark	Modular VLM (baseline)	NEO‑ov (native)
Image‑Text Retrieval (MSCOCO)	78.4 R@1	79.1 R@1
Video Question Answering (MSRVTT‑QA)	44.2 %	45.6 %
Fine‑Grained Spatial Reasoning (RefCOCO)	71.3 %	73.0 %
Zero‑Shot Classification (ImageNet)	71.8 %	71.5 %

Narrowed Gap: Across most standard vision‑language tasks, NEO‑ov matches or slightly outperforms modular baselines despite its simpler pipeline.
Superior Spatial Intelligence: The model shows a clear edge on tasks that demand pixel‑level grounding (e.g., referring expression comprehension).
Scalability: Experiments scaling the model from 300 M to 2 B parameters indicate steady performance gains, confirming that the native approach scales similarly to modular counterparts.

Practical Implications

Simplified Deployment – A single model file handling both vision and language reduces inference latency and memory overhead compared to pipelines that stitch together separate encoders and decoders.
Better Video Understanding – Direct cross‑frame attention makes NEO‑ov a strong candidate for applications like video captioning, surveillance analytics, or interactive media where temporal context matters.
Fine‑Grained UI/AR – Pixel‑level grounding can power more accurate visual assistants, AR overlays, and robotics perception systems that need to map language commands to precise image regions.
Unified Fine‑Tuning – Developers can fine‑tune the same checkpoint for diverse downstream tasks (retrieval, VQA, captioning) without re‑architecting the model, accelerating product iteration cycles.

Limitations & Future Work

Training Cost – End‑to‑end native training still requires massive compute (multi‑node GPUs) and large curated datasets, which may be prohibitive for smaller labs.
Generalization to High‑Resolution Inputs – Processing very high‑resolution images or long video sequences can exceed token limits; future work could explore hierarchical tokenization or memory‑efficient attention.
Interpretability – While the model learns pixel‑word alignments, the internal attention patterns are less transparent than explicit alignment modules; tools for visualizing cross‑modal attention would be valuable.
Domain Adaptation – Adapting NEO‑ov to specialized domains (medical imaging, satellite data) may require additional domain‑specific pre‑training or curriculum strategies.

NEO‑ov demonstrates that a truly “one‑vision” foundation model is not only possible but also competitive, offering a streamlined path for developers to build next‑generation multimodal applications.

Authors

Haiwen Diao
Jiahao Wang
Penghao Wu
Yuhao Dong
Yuwei Niu
Yue Zhu
Zhongang Cai
Weichen Fan
Linjun Dai
Silei Wu
Xuanyu Zheng
Mingxuan Li
Yuanhan Zhang
Bo Li
Hanming Deng
Huchuan Lu
Quan Wang
Lei Yang
Lewei Lu
Dahua Lin
Ziwei Liu

Paper Information

arXiv ID: 2605.28820v1
Categories: cs.CV
Published: May 27, 2026
PDF: Download PDF

[Paper] From Pixels to Words -- Towards Native One-Vision Models at Scale

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

[Paper] KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

[Paper] TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

[Paper] Vision-Language Models Suppress Female Representations Under Ambiguous Input