[Paper] Monet: Reasoning in Latent Visual Space Beyond Images and Language
Source: arXiv - 2511.21395v1
Overview
The paper presents Monet, a new training framework that lets multimodal large language models (MLLMs) “think” in a latent visual space instead of swapping back and forth between raw images and text. By generating continuous visual embeddings as intermediate reasoning steps, Monet moves visual reasoning closer to how humans form abstract visual thoughts, unlocking stronger performance on real‑world and out‑of‑distribution visual tasks.
Key Contributions
- Latent‑visual reasoning: Introduces a paradigm where MLLMs manipulate internal image embeddings directly, eliminating the need for external vision tools during chain‑of‑thought (CoT) generation.
- Three‑stage distillation SFT pipeline: A cost‑effective fine‑tuning recipe that aligns language and vision latent spaces while providing strong supervision for the generated embeddings.
- VLPO (Visual‑latent Policy Optimization): A reinforcement‑learning‑based policy‑gradient method that explicitly incorporates latent visual embeddings into the reward signal, improving visual reasoning beyond text‑only RL tricks.
- Monet‑SFT‑125K dataset: A curated 125 K example CoT collection covering real‑world photos, charts, OCR, and geometry problems, each interleaving text and latent‑visual steps.
- Monet‑7B model: A 7‑billion‑parameter MLLM that consistently outperforms prior baselines on perception, reasoning, and abstract visual benchmarks, demonstrating strong generalization to unseen visual concepts.
Methodology
-
Latent Visual Space: Instead of feeding raw pixels to a vision encoder at every reasoning step, Monet’s language model predicts a continuous embedding vector that represents the “visual thought” it wants to use next. This embedding is later decoded by a frozen vision decoder only when a final answer is needed.
-
Three‑Stage Distillation SFT
- Stage 1 – Vision‑Language Alignment: A teacher vision‑language model (e.g., CLIP) provides target embeddings for each image; the MLLM learns to mimic them.
- Stage 2 – Chain‑of‑Thought Supervision: Using the Monet‑SFT‑125K dataset, the model is fine‑tuned to produce alternating text and latent‑visual tokens that match human‑written CoTs.
- Stage 3 – Reinforcement Fine‑Tuning: VLPO applies policy‑gradient updates where the reward combines standard language correctness (e.g., answer accuracy) and a latent‑visual consistency term that measures how well the predicted embeddings align with the teacher’s latent space.
-
VLPO vs. GRPO: The authors show that the commonly used Generalized Reinforcement‑Policy Optimization (GRPO) improves only textual reasoning. VLPO adds a latent‑visual loss to the gradient, directly encouraging the model to generate useful visual embeddings.
Results & Findings
| Benchmark | Metric (↑ better) | Monet‑7B vs. Strong Baseline |
|---|---|---|
| VQA‑Real (real‑world perception) | Accuracy 73.4% | +5.2 pts |
| ChartQA (chart reasoning) | Exact Match 68.1% | +6.8 pts |
| OCR‑CoT (text extraction + reasoning) | F1 81.7% | +4.5 pts |
| Abstract Geometry (out‑of‑distribution) | Solve Rate 62.3% | +9.1 pts |
Key takeaways
- Latent visual reasoning yields consistent gains across diverse tasks, especially where intermediate visual abstraction is crucial (charts, geometry).
- Ablation studies confirm that removing VLPO drops performance by ~3–4 % on visual‑heavy benchmarks, while discarding the distillation stages hurts alignment quality dramatically.
- The model maintains comparable speed to text‑only MLLMs because the heavy vision decoder runs only once at inference end‑point.
Practical Implications
- Developer‑friendly APIs: Monet can be wrapped as a single endpoint that accepts a prompt and optional image, returning a textual answer without requiring separate vision calls for each reasoning step. This simplifies integration into chatbots, data‑analysis assistants, and low‑code platforms.
- Cost‑effective scaling: By keeping the vision encoder frozen and only generating lightweight embeddings, Monet reduces GPU memory and compute compared to full vision‑language pipelines, making it viable for on‑premise deployment or edge inference with a modest GPU.
- Enhanced UI/UX for visual assistants: Applications like document processing, dashboard analytics, or design review can now ask the model to “visualize” intermediate concepts (e.g., “draw the bounding box of the highlighted region”) without explicit image rendering, enabling richer, more natural interactions.
- Foundation for abstract reasoning: The latent‑visual approach opens doors for tasks that require mental imagery—such as planning robot motions from textual descriptions or reasoning about scientific diagrams—without hand‑crafting visual prompts.
Limitations & Future Work
- Dependence on frozen vision decoder: The quality of latent embeddings is bounded by the pre‑trained vision model; improvements may require joint training or better decoders.
- Dataset bias: Monet‑SFT‑125K, while diverse, still leans heavily on English‑centric sources and may not capture cultural visual conventions worldwide.
- Scalability to larger models: Experiments are limited to a 7 B parameter backbone; it remains open how the approach scales to 30 B+ models or multi‑modal instruction tuning.
- Interpretability of latent thoughts: The embeddings are not directly human‑readable, making debugging of “visual mistakes” harder; future work could explore visualizing intermediate embeddings or mapping them to symbolic sketches.
Monet demonstrates that embedding visual reasoning directly into the language model’s latent space is not only feasible but also practically beneficial, paving the way for more compact and cognitively aligned multimodal AI systems.
Authors
- Qixun Wang
- Yang Shi
- Yifei Wang
- Yuanxing Zhang
- Pengfei Wan
- Kun Gai
- Xianghua Ying
- Yisen Wang
Paper Information
- arXiv ID: 2511.21395v1
- Categories: cs.CV, cs.AI
- Published: November 26, 2025
- PDF: Download PDF