[Paper] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Source: arXiv - 2602.23306v1
Overview
The paper introduces ThinkOmni, a plug‑and‑play framework that lets existing omni‑modal large language models (OLLMs) inherit the sophisticated reasoning abilities of state‑of‑the‑art large reasoning models (LRMs) – without any extra training or data collection. By treating a powerful LRM as a “reasoning guide” during inference, ThinkOmni bridges the gap between perception‑heavy multimodal models and the deep logical chains required for tasks such as math, commonsense, and visual question answering.
Key Contributions
- Training‑free reasoning augmentation: Enables OLLMs to perform complex textual reasoning in multimodal contexts without fine‑tuning.
- LRM‑as‑a‑Guide: A novel inference‑time decoding strategy that consults an off‑the‑shelf LRM to steer the OLLM’s token generation.
- Stepwise Contrastive Scaling (SCS): An adaptive mechanism that automatically balances visual‑perceptual signals and textual‑reasoning cues, eliminating manual hyper‑parameter sweeps.
- Broad empirical validation: Consistent gains across six diverse multimodal reasoning benchmarks (e.g., MathVista, MMAU), achieving new state‑of‑the‑art scores (70.2 on MathVista, 75.5 on MMAU).
- General‑purpose recipe: Works with any compatible OLLM/LRM pair, making it a reusable “add‑on” for existing AI services.
Methodology
-
Dual‑model setup
- Perceiver: An omni‑modal LLM (e.g., CLIP‑based or Flamingo‑style) that ingests images, video frames, or other modalities and produces a textual context.
- Reasoner: A large language model specialized in chain‑of‑thought reasoning (e.g., GPT‑4, Claude).
-
Guidance Decoding
- During each generation step, the OLLM proposes a distribution over the next token.
- The LRM receives the same multimodal prompt (converted to text) and produces its own token distribution, reflecting pure reasoning.
- The two distributions are fused: the OLLM’s perception‑driven logits are scaled by a contrastive factor derived from the LRM’s logits, nudging the final output toward reasoning‑consistent tokens.
-
Stepwise Contrastive Scaling (SCS)
- Instead of a fixed weighting (e.g., 0.5 × perception + 0.5 × reasoning), SCS computes a dynamic scaling coefficient per decoding step based on the similarity between the two logits.
- When the LRM’s confidence is high, the scaling leans more heavily on reasoning; when the OLLM’s visual signal dominates, the system respects perception.
- This adaptive balance eliminates the need for exhaustive hyper‑parameter tuning across tasks.
-
Zero‑training pipeline
- The framework requires only the pre‑trained OLLM and LRM; no additional datasets, fine‑tuning loops, or gradient updates are performed.
- Implementation is a lightweight wrapper around the standard generation API, making it easy to drop into existing inference services.
Results & Findings
| Benchmark | Baseline OLLM | ThinkOmni (OLLM + LRM) | Δ Improvement |
|---|---|---|---|
| MathVista | 63.1 | 70.2 | +7.1 |
| MMAU | 68.4 | 75.5 | +7.1 |
| VQA‑Reason | 71.3 | 77.0 | +5.7 |
| ScienceQA‑MM | 66.8 | 73.2 | +6.4 |
| DocVQA‑Multi | 72.5 | 78.1 | +5.6 |
| Visual‑Commonsense | 69.0 | 74.8 | +5.8 |
- Consistent uplift across domains (math, science, commonsense, document understanding).
- Ablation studies show that removing SCS or the LRM guide drops performance back to baseline, confirming both components are essential.
- Latency impact is modest: inference time grows ~1.3× due to the extra LRM pass, which is acceptable for many real‑time applications when weighed against the accuracy gains.
Practical Implications
- Rapid capability boost: Companies can instantly enhance their multimodal products (e.g., visual assistants, educational bots) without costly retraining pipelines.
- Modular AI stacks: ThinkOmni encourages a “best‑of‑both‑worlds” architecture where perception and reasoning modules are developed independently and combined at inference time.
- Cost‑effective scaling: By reusing existing LLM APIs (e.g., OpenAI, Anthropic) as reasoning guides, developers avoid the massive GPU budgets typically required for multimodal fine‑tuning.
- Improved safety & interpretability: The LRM’s chain‑of‑thought outputs can be logged alongside the final answer, offering a transparent reasoning trace that can be audited or used for debugging.
- Edge‑to‑cloud hybrid deployments: The perception‑heavy OLLM can run on edge devices, while the LRM guide can be invoked in the cloud only when a complex reasoning step is detected, optimizing bandwidth and latency.
Limitations & Future Work
- Dependency on LRM quality: The framework’s ceiling is bounded by the reasoning model’s capabilities; a weak LRM will limit gains.
- Inference overhead: Running two large models in parallel doubles memory usage and increases latency, which may be prohibitive for low‑resource environments.
- Modal conversion bottleneck: Current implementation translates non‑text modalities to textual descriptions for the LRM, potentially losing fine‑grained visual cues.
- Future directions suggested by the authors include:
- Exploring lightweight reasoning guides (e.g., distilled LRMs) to reduce compute.
- Extending SCS to handle more than two modalities simultaneously.
- Integrating feedback loops where the OLLM can request clarification from the LRM on ambiguous visual inputs.
Authors
- Yiran Guan
- Sifan Tu
- Dingkang Liang
- Linghao Zhu
- Jianzhong Ju
- Zhenbo Luo
- Jian Luan
- Yuliang Liu
- Xiang Bai
Paper Information
- arXiv ID: 2602.23306v1
- Categories: cs.CV
- Published: February 26, 2026
- PDF: Download PDF