[Paper] Mull-Tokens: Modality-Agnostic Latent Thinking
Source: arXiv - 2512.10941v1
Overview
Mull‑Tokens introduces a modality‑agnostic latent reasoning layer that can hold intermediate “thoughts” from both text and images. By training these tokens to act as a shared workspace, the model can freely switch between visual and linguistic information without relying on heavyweight specialist tools or handcrafted reasoning pipelines. The result is a more robust, scalable approach to multimodal reasoning that pushes performance on spatial‑heavy benchmarks.
Key Contributions
- Modality‑agnostic latent tokens (Mull‑Tokens) that serve as a universal reasoning buffer for text and images.
- Two‑stage training recipe: (1) supervised pre‑training on interleaved text‑image reasoning traces, followed by (2) unsupervised fine‑tuning using only final answer supervision.
- Empirical gains of +3 % average accuracy and up to +16 % on a puzzle‑solving split across four spatial reasoning datasets, outperforming strong text‑only and interleaved baselines.
- Practical recipe for integrating Mull‑Tokens into existing vision‑language architectures with minimal architectural changes.
Methodology
- Latent Token Design – A small set of learnable vectors (the Mull‑Tokens) is appended to the transformer’s token stream. These vectors are not tied to any specific modality; they can absorb visual embeddings, textual embeddings, or a mixture of both.
- Supervised Pre‑training – The model is fed reasoning traces: sequences that alternate between textual prompts and image patches, with intermediate supervision that tells the Mull‑Tokens what “thought” they should capture at each step.
- Self‑Supervised Fine‑tuning – After the trace‑level supervision is removed, the model is trained only on the final answer (e.g., multiple‑choice label). The Mull‑Tokens learn to self‑organize the necessary intermediate reasoning without explicit guidance.
- Integration – Mull‑Tokens are inserted into standard vision‑language backbones (e.g., ViLT, CLIP‑based transformers). During inference, the model simply runs a forward pass; the tokens automatically mediate cross‑modal information flow.
Results & Findings
| Benchmark (Spatial Reasoning) | Baseline (Text‑Only) | Baseline (Interleaved) | Mull‑Tokens | Δ vs. Best Baseline |
|---|---|---|---|---|
| Puzzle‑Solve (Heavy) | 62 % | 68 % | 84 % | +16 % |
| 3D‑Perspective Shift | 71 % | 73 % | 76 % | +3 % |
| Object‑Relation Grid | 68 % | 70 % | 73 % | +3 % |
| Multi‑Step Navigation | 65 % | 66 % | 69 % | +3 % |
- Consistent improvement across all four datasets, confirming that a shared latent workspace helps the model synthesize visual and textual cues.
- Ablation studies show that removing the supervised trace stage drops performance by ~5 %, highlighting its role in shaping useful token dynamics.
- Token count analysis reveals diminishing returns beyond 8 Mull‑Tokens, suggesting a sweet spot between capacity and computational overhead.
Practical Implications
- Simplified pipelines – Developers can replace complex tool‑chaining (e.g., separate OCR, scene graph generators, and reasoning modules) with a single transformer augmented by Mull‑Tokens.
- Scalable to new domains – Because the tokens are modality‑agnostic, the same architecture can be fine‑tuned on robotics, AR/VR, or e‑commerce scenarios that require spatial or affordance reasoning.
- Lower inference cost – No need for external image generation or symbolic reasoning engines; the extra token embeddings add only a modest memory footprint.
- Plug‑and‑play – Existing vision‑language models can adopt Mull‑Tokens with a few lines of code, making it attractive for rapid prototyping in products that need “visual commonsense” (e.g., virtual assistants that understand layout instructions).
Limitations & Future Work
- Domain specificity – The current training traces are curated for spatial puzzles; performance on abstract reasoning (e.g., causal inference) remains untested.
- Token capacity ceiling – While 8 tokens work well, scaling to more complex, multi‑step tasks may require hierarchical token structures or dynamic token allocation.
- Interpretability – The latent thoughts are not directly human‑readable; future work could explore probing or visualizing token activations to aid debugging.
- Cross‑modal pre‑training data – The approach still depends on high‑quality interleaved text‑image datasets; building larger, more diverse trace corpora could further boost generalization.
Bottom line: Mull‑Tokens offers a clean, scalable way to give multimodal models a shared “thinking space,” delivering measurable gains on challenging spatial reasoning tasks while keeping the engineering stack simple enough for real‑world deployment.
Authors
- Arijit Ray
- Ahmed Abdelkader
- Chengzhi Mao
- Bryan A. Plummer
- Kate Saenko
- Ranjay Krishna
- Leonidas Guibas
- Wen‑Sheng Chu
Paper Information
- arXiv ID: 2512.10941v1
- Categories: cs.CV, cs.AI
- Published: December 11, 2025
- PDF: Download PDF