[Paper] From Pixels to Words -- Towards Native One-Vision Models at Scale

Published: (May 27, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.28820v1

Overview

The paper presents NEO‑ov, a “native” vision‑language foundation model that learns to link pixels and words across multiple frames end‑to‑end, without relying on separate image encoders, language decoders, or post‑hoc fusion modules. By removing the traditional modular pipeline, the authors show that a single unified architecture can achieve competitive (and sometimes superior) performance on fine‑grained visual tasks, opening the door to more seamless multimodal AI at scale.

Key Contributions

  • One‑Vision Architecture: Introduces the first large‑scale native vision‑language model (NEO‑ov) that processes raw pixels and text jointly, eliminating external encoders and adapters.
  • Cross‑Frame Pixel‑Word Alignment: Enables fine‑grained correspondence between image patches and words across time, supporting video and multi‑image understanding.
  • Competitive Performance: Demonstrates that NEO‑ov narrows the accuracy gap with state‑of‑the‑art modular VLMs while excelling in tasks requiring detailed spatial reasoning.
  • Systematic Architectural Analysis: Provides ablation studies and design guidelines (e.g., tokenization strategy, attention scaling) that help researchers replicate or extend native multimodal models.
  • Open‑Source Release: Shares training recipes, code, and pretrained checkpoints, facilitating rapid adoption by the community.

Methodology

  1. Unified Transformer Backbone – A single transformer ingests a sequence composed of image patches (flattened pixel grids) and tokenized text. No separate CNN or vision transformer is pre‑trained; all parameters are learned jointly.
  2. Spatiotemporal Tokenization – For video or multi‑image inputs, each frame is split into non‑overlapping patches, and a lightweight positional encoding injects both spatial and temporal information.
  3. Cross‑Modal Attention – Standard multi‑head self‑attention operates over the combined token stream, allowing any word token to attend directly to any pixel patch, regardless of frame. This yields pixel‑word correspondence at every layer.
  4. Training Objective – A contrastive loss (image‑text matching) together with a masked language modeling objective encourages the model to predict missing words from visual context and vice‑versa.
  5. Scaling Strategy – Training on large‑scale image‑text datasets (e.g., LAION) using mixed‑precision and distributed data parallelism, following recipes that balance model depth, width, and batch size for optimal convergence.

Results & Findings

BenchmarkModular VLM (baseline)NEO‑ov (native)
Image‑Text Retrieval (MSCOCO)78.4 R@179.1 R@1
Video Question Answering (MSRVTT‑QA)44.2 %45.6 %
Fine‑Grained Spatial Reasoning (RefCOCO)71.3 %73.0 %
Zero‑Shot Classification (ImageNet)71.8 %71.5 %
  • Narrowed Gap: Across most standard vision‑language tasks, NEO‑ov matches or slightly outperforms modular baselines despite its simpler pipeline.
  • Superior Spatial Intelligence: The model shows a clear edge on tasks that demand pixel‑level grounding (e.g., referring expression comprehension).
  • Scalability: Experiments scaling the model from 300 M to 2 B parameters indicate steady performance gains, confirming that the native approach scales similarly to modular counterparts.

Practical Implications

  • Simplified Deployment – A single model file handling both vision and language reduces inference latency and memory overhead compared to pipelines that stitch together separate encoders and decoders.
  • Better Video Understanding – Direct cross‑frame attention makes NEO‑ov a strong candidate for applications like video captioning, surveillance analytics, or interactive media where temporal context matters.
  • Fine‑Grained UI/AR – Pixel‑level grounding can power more accurate visual assistants, AR overlays, and robotics perception systems that need to map language commands to precise image regions.
  • Unified Fine‑Tuning – Developers can fine‑tune the same checkpoint for diverse downstream tasks (retrieval, VQA, captioning) without re‑architecting the model, accelerating product iteration cycles.

Limitations & Future Work

  • Training Cost – End‑to‑end native training still requires massive compute (multi‑node GPUs) and large curated datasets, which may be prohibitive for smaller labs.
  • Generalization to High‑Resolution Inputs – Processing very high‑resolution images or long video sequences can exceed token limits; future work could explore hierarchical tokenization or memory‑efficient attention.
  • Interpretability – While the model learns pixel‑word alignments, the internal attention patterns are less transparent than explicit alignment modules; tools for visualizing cross‑modal attention would be valuable.
  • Domain Adaptation – Adapting NEO‑ov to specialized domains (medical imaging, satellite data) may require additional domain‑specific pre‑training or curriculum strategies.

NEO‑ov demonstrates that a truly “one‑vision” foundation model is not only possible but also competitive, offering a streamlined path for developers to build next‑generation multimodal applications.

Authors

  • Haiwen Diao
  • Jiahao Wang
  • Penghao Wu
  • Yuhao Dong
  • Yuwei Niu
  • Yue Zhu
  • Zhongang Cai
  • Weichen Fan
  • Linjun Dai
  • Silei Wu
  • Xuanyu Zheng
  • Mingxuan Li
  • Yuanhan Zhang
  • Bo Li
  • Hanming Deng
  • Huchuan Lu
  • Quan Wang
  • Lei Yang
  • Lewei Lu
  • Dahua Lin
  • Ziwei Liu

Paper Information

  • arXiv ID: 2605.28820v1
  • Categories: cs.CV
  • Published: May 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »