[Paper] Mull-Tokens: Modality-Agnostic Latent Thinking

Published: (December 11, 2025 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.10941v1

Overview

Mull‑Tokens introduces a modality‑agnostic latent reasoning layer that can hold intermediate “thoughts” from both text and images. By training these tokens to act as a shared workspace, the model can freely switch between visual and linguistic information without relying on heavyweight specialist tools or handcrafted reasoning pipelines. The result is a more robust, scalable approach to multimodal reasoning that pushes performance on spatial‑heavy benchmarks.

Key Contributions

  • Modality‑agnostic latent tokens (Mull‑Tokens) that serve as a universal reasoning buffer for text and images.
  • Two‑stage training recipe: (1) supervised pre‑training on interleaved text‑image reasoning traces, followed by (2) unsupervised fine‑tuning using only final answer supervision.
  • Empirical gains of +3 % average accuracy and up to +16 % on a puzzle‑solving split across four spatial reasoning datasets, outperforming strong text‑only and interleaved baselines.
  • Practical recipe for integrating Mull‑Tokens into existing vision‑language architectures with minimal architectural changes.

Methodology

  1. Latent Token Design – A small set of learnable vectors (the Mull‑Tokens) is appended to the transformer’s token stream. These vectors are not tied to any specific modality; they can absorb visual embeddings, textual embeddings, or a mixture of both.
  2. Supervised Pre‑training – The model is fed reasoning traces: sequences that alternate between textual prompts and image patches, with intermediate supervision that tells the Mull‑Tokens what “thought” they should capture at each step.
  3. Self‑Supervised Fine‑tuning – After the trace‑level supervision is removed, the model is trained only on the final answer (e.g., multiple‑choice label). The Mull‑Tokens learn to self‑organize the necessary intermediate reasoning without explicit guidance.
  4. Integration – Mull‑Tokens are inserted into standard vision‑language backbones (e.g., ViLT, CLIP‑based transformers). During inference, the model simply runs a forward pass; the tokens automatically mediate cross‑modal information flow.

Results & Findings

Benchmark (Spatial Reasoning)Baseline (Text‑Only)Baseline (Interleaved)Mull‑TokensΔ vs. Best Baseline
Puzzle‑Solve (Heavy)62 %68 %84 %+16 %
3D‑Perspective Shift71 %73 %76 %+3 %
Object‑Relation Grid68 %70 %73 %+3 %
Multi‑Step Navigation65 %66 %69 %+3 %
  • Consistent improvement across all four datasets, confirming that a shared latent workspace helps the model synthesize visual and textual cues.
  • Ablation studies show that removing the supervised trace stage drops performance by ~5 %, highlighting its role in shaping useful token dynamics.
  • Token count analysis reveals diminishing returns beyond 8 Mull‑Tokens, suggesting a sweet spot between capacity and computational overhead.

Practical Implications

  • Simplified pipelines – Developers can replace complex tool‑chaining (e.g., separate OCR, scene graph generators, and reasoning modules) with a single transformer augmented by Mull‑Tokens.
  • Scalable to new domains – Because the tokens are modality‑agnostic, the same architecture can be fine‑tuned on robotics, AR/VR, or e‑commerce scenarios that require spatial or affordance reasoning.
  • Lower inference cost – No need for external image generation or symbolic reasoning engines; the extra token embeddings add only a modest memory footprint.
  • Plug‑and‑play – Existing vision‑language models can adopt Mull‑Tokens with a few lines of code, making it attractive for rapid prototyping in products that need “visual commonsense” (e.g., virtual assistants that understand layout instructions).

Limitations & Future Work

  • Domain specificity – The current training traces are curated for spatial puzzles; performance on abstract reasoning (e.g., causal inference) remains untested.
  • Token capacity ceiling – While 8 tokens work well, scaling to more complex, multi‑step tasks may require hierarchical token structures or dynamic token allocation.
  • Interpretability – The latent thoughts are not directly human‑readable; future work could explore probing or visualizing token activations to aid debugging.
  • Cross‑modal pre‑training data – The approach still depends on high‑quality interleaved text‑image datasets; building larger, more diverse trace corpora could further boost generalization.

Bottom line: Mull‑Tokens offers a clean, scalable way to give multimodal models a shared “thinking space,” delivering measurable gains on challenging spatial reasoning tasks while keeping the engineering stack simple enough for real‑world deployment.

Authors

  • Arijit Ray
  • Ahmed Abdelkader
  • Chengzhi Mao
  • Bryan A. Plummer
  • Kate Saenko
  • Ranjay Krishna
  • Leonidas Guibas
  • Wen‑Sheng Chu

Paper Information

  • arXiv ID: 2512.10941v1
  • Categories: cs.CV, cs.AI
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »