[Paper] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Published: 2 months ago (November 26, 2025 at 08:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21395v1

Overview

The paper presents Monet, a new training framework that lets multimodal large language models (MLLMs) “think” in a latent visual space instead of swapping back and forth between raw images and text. By generating continuous visual embeddings as intermediate reasoning steps, Monet moves visual reasoning closer to how humans form abstract visual thoughts, unlocking stronger performance on real‑world and out‑of‑distribution visual tasks.

Key Contributions

Latent‑visual reasoning: Introduces a paradigm where MLLMs manipulate internal image embeddings directly, eliminating the need for external vision tools during chain‑of‑thought (CoT) generation.
Three‑stage distillation SFT pipeline: A cost‑effective fine‑tuning recipe that aligns language and vision latent spaces while providing strong supervision for the generated embeddings.
VLPO (Visual‑latent Policy Optimization): A reinforcement‑learning‑based policy‑gradient method that explicitly incorporates latent visual embeddings into the reward signal, improving visual reasoning beyond text‑only RL tricks.
Monet‑SFT‑125K dataset: A curated 125 K example CoT collection covering real‑world photos, charts, OCR, and geometry problems, each interleaving text and latent‑visual steps.
Monet‑7B model: A 7‑billion‑parameter MLLM that consistently outperforms prior baselines on perception, reasoning, and abstract visual benchmarks, demonstrating strong generalization to unseen visual concepts.

Methodology

Latent Visual Space: Instead of feeding raw pixels to a vision encoder at every reasoning step, Monet’s language model predicts a continuous embedding vector that represents the “visual thought” it wants to use next. This embedding is later decoded by a frozen vision decoder only when a final answer is needed.
Three‑Stage Distillation SFT
- Stage 1 – Vision‑Language Alignment: A teacher vision‑language model (e.g., CLIP) provides target embeddings for each image; the MLLM learns to mimic them.
- Stage 2 – Chain‑of‑Thought Supervision: Using the Monet‑SFT‑125K dataset, the model is fine‑tuned to produce alternating text and latent‑visual tokens that match human‑written CoTs.
- Stage 3 – Reinforcement Fine‑Tuning: VLPO applies policy‑gradient updates where the reward combines standard language correctness (e.g., answer accuracy) and a latent‑visual consistency term that measures how well the predicted embeddings align with the teacher’s latent space.
VLPO vs. GRPO: The authors show that the commonly used Generalized Reinforcement‑Policy Optimization (GRPO) improves only textual reasoning. VLPO adds a latent‑visual loss to the gradient, directly encouraging the model to generate useful visual embeddings.

Results & Findings

Benchmark	Metric (↑ better)	Monet‑7B vs. Strong Baseline
VQA‑Real (real‑world perception)	Accuracy 73.4%	+5.2 pts
ChartQA (chart reasoning)	Exact Match 68.1%	+6.8 pts
OCR‑CoT (text extraction + reasoning)	F1 81.7%	+4.5 pts
Abstract Geometry (out‑of‑distribution)	Solve Rate 62.3%	+9.1 pts

Key takeaways

Latent visual reasoning yields consistent gains across diverse tasks, especially where intermediate visual abstraction is crucial (charts, geometry).
Ablation studies confirm that removing VLPO drops performance by ~3–4 % on visual‑heavy benchmarks, while discarding the distillation stages hurts alignment quality dramatically.
The model maintains comparable speed to text‑only MLLMs because the heavy vision decoder runs only once at inference end‑point.

Practical Implications

Developer‑friendly APIs: Monet can be wrapped as a single endpoint that accepts a prompt and optional image, returning a textual answer without requiring separate vision calls for each reasoning step. This simplifies integration into chatbots, data‑analysis assistants, and low‑code platforms.
Cost‑effective scaling: By keeping the vision encoder frozen and only generating lightweight embeddings, Monet reduces GPU memory and compute compared to full vision‑language pipelines, making it viable for on‑premise deployment or edge inference with a modest GPU.
Enhanced UI/UX for visual assistants: Applications like document processing, dashboard analytics, or design review can now ask the model to “visualize” intermediate concepts (e.g., “draw the bounding box of the highlighted region”) without explicit image rendering, enabling richer, more natural interactions.
Foundation for abstract reasoning: The latent‑visual approach opens doors for tasks that require mental imagery—such as planning robot motions from textual descriptions or reasoning about scientific diagrams—without hand‑crafting visual prompts.

Limitations & Future Work

Dependence on frozen vision decoder: The quality of latent embeddings is bounded by the pre‑trained vision model; improvements may require joint training or better decoders.
Dataset bias: Monet‑SFT‑125K, while diverse, still leans heavily on English‑centric sources and may not capture cultural visual conventions worldwide.
Scalability to larger models: Experiments are limited to a 7 B parameter backbone; it remains open how the approach scales to 30 B+ models or multi‑modal instruction tuning.
Interpretability of latent thoughts: The embeddings are not directly human‑readable, making debugging of “visual mistakes” harder; future work could explore visualizing intermediate embeddings or mapping them to symbolic sketches.

Monet demonstrates that embedding visual reasoning directly into the language model’s latent space is not only feasible but also practically beneficial, paving the way for more compact and cognitively aligned multimodal AI systems.

Authors

Qixun Wang
Yang Shi
Yifei Wang
Yuanxing Zhang
Pengfei Wan
Kun Gai
Xianghua Ying
Yisen Wang

Paper Information

arXiv ID: 2511.21395v1
Categories: cs.CV, cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

[Paper] TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

[Paper] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning