[Paper] Mull-Tokens: Modality-Agnostic Latent Thinking

Published: 1 month ago (December 11, 2025 at 01:59 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.10941v1

Overview

Mull‑Tokens introduces a modality‑agnostic latent reasoning layer that can hold intermediate “thoughts” from both text and images. By training these tokens to act as a shared workspace, the model can freely switch between visual and linguistic information without relying on heavyweight specialist tools or handcrafted reasoning pipelines. The result is a more robust, scalable approach to multimodal reasoning that pushes performance on spatial‑heavy benchmarks.

Key Contributions

Modality‑agnostic latent tokens (Mull‑Tokens) that serve as a universal reasoning buffer for text and images.
Two‑stage training recipe: (1) supervised pre‑training on interleaved text‑image reasoning traces, followed by (2) unsupervised fine‑tuning using only final answer supervision.
Empirical gains of +3 % average accuracy and up to +16 % on a puzzle‑solving split across four spatial reasoning datasets, outperforming strong text‑only and interleaved baselines.
Practical recipe for integrating Mull‑Tokens into existing vision‑language architectures with minimal architectural changes.

Methodology

Latent Token Design – A small set of learnable vectors (the Mull‑Tokens) is appended to the transformer’s token stream. These vectors are not tied to any specific modality; they can absorb visual embeddings, textual embeddings, or a mixture of both.
Supervised Pre‑training – The model is fed reasoning traces: sequences that alternate between textual prompts and image patches, with intermediate supervision that tells the Mull‑Tokens what “thought” they should capture at each step.
Self‑Supervised Fine‑tuning – After the trace‑level supervision is removed, the model is trained only on the final answer (e.g., multiple‑choice label). The Mull‑Tokens learn to self‑organize the necessary intermediate reasoning without explicit guidance.
Integration – Mull‑Tokens are inserted into standard vision‑language backbones (e.g., ViLT, CLIP‑based transformers). During inference, the model simply runs a forward pass; the tokens automatically mediate cross‑modal information flow.

Results & Findings

Benchmark (Spatial Reasoning)	Baseline (Text‑Only)	Baseline (Interleaved)	Mull‑Tokens	Δ vs. Best Baseline
Puzzle‑Solve (Heavy)	62 %	68 %	84 %	+16 %
3D‑Perspective Shift	71 %	73 %	76 %	+3 %
Object‑Relation Grid	68 %	70 %	73 %	+3 %
Multi‑Step Navigation	65 %	66 %	69 %	+3 %

Consistent improvement across all four datasets, confirming that a shared latent workspace helps the model synthesize visual and textual cues.
Ablation studies show that removing the supervised trace stage drops performance by ~5 %, highlighting its role in shaping useful token dynamics.
Token count analysis reveals diminishing returns beyond 8 Mull‑Tokens, suggesting a sweet spot between capacity and computational overhead.

Practical Implications

Simplified pipelines – Developers can replace complex tool‑chaining (e.g., separate OCR, scene graph generators, and reasoning modules) with a single transformer augmented by Mull‑Tokens.
Scalable to new domains – Because the tokens are modality‑agnostic, the same architecture can be fine‑tuned on robotics, AR/VR, or e‑commerce scenarios that require spatial or affordance reasoning.
Lower inference cost – No need for external image generation or symbolic reasoning engines; the extra token embeddings add only a modest memory footprint.
Plug‑and‑play – Existing vision‑language models can adopt Mull‑Tokens with a few lines of code, making it attractive for rapid prototyping in products that need “visual commonsense” (e.g., virtual assistants that understand layout instructions).

Limitations & Future Work

Domain specificity – The current training traces are curated for spatial puzzles; performance on abstract reasoning (e.g., causal inference) remains untested.
Token capacity ceiling – While 8 tokens work well, scaling to more complex, multi‑step tasks may require hierarchical token structures or dynamic token allocation.
Interpretability – The latent thoughts are not directly human‑readable; future work could explore probing or visualizing token activations to aid debugging.
Cross‑modal pre‑training data – The approach still depends on high‑quality interleaved text‑image datasets; building larger, more diverse trace corpora could further boost generalization.

Bottom line: Mull‑Tokens offers a clean, scalable way to give multimodal models a shared “thinking space,” delivering measurable gains on challenging spatial reasoning tasks while keeping the engineering stack simple enough for real‑world deployment.

Authors

Arijit Ray
Ahmed Abdelkader
Chengzhi Mao
Bryan A. Plummer
Kate Saenko
Ranjay Krishna
Leonidas Guibas
Wen‑Sheng Chu

Paper Information

arXiv ID: 2512.10941v1
Categories: cs.CV, cs.AI
Published: December 11, 2025
PDF: Download PDF

[Paper] Mull-Tokens: Modality-Agnostic Latent Thinking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

[Paper] Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems