[Paper] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Published: 3 days ago (February 26, 2026 at 01:10 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23306v1

Overview

The paper introduces ThinkOmni, a plug‑and‑play framework that lets existing omni‑modal large language models (OLLMs) inherit the sophisticated reasoning abilities of state‑of‑the‑art large reasoning models (LRMs) – without any extra training or data collection. By treating a powerful LRM as a “reasoning guide” during inference, ThinkOmni bridges the gap between perception‑heavy multimodal models and the deep logical chains required for tasks such as math, commonsense, and visual question answering.

Key Contributions

Training‑free reasoning augmentation: Enables OLLMs to perform complex textual reasoning in multimodal contexts without fine‑tuning.
LRM‑as‑a‑Guide: A novel inference‑time decoding strategy that consults an off‑the‑shelf LRM to steer the OLLM’s token generation.
Stepwise Contrastive Scaling (SCS): An adaptive mechanism that automatically balances visual‑perceptual signals and textual‑reasoning cues, eliminating manual hyper‑parameter sweeps.
Broad empirical validation: Consistent gains across six diverse multimodal reasoning benchmarks (e.g., MathVista, MMAU), achieving new state‑of‑the‑art scores (70.2 on MathVista, 75.5 on MMAU).
General‑purpose recipe: Works with any compatible OLLM/LRM pair, making it a reusable “add‑on” for existing AI services.

Methodology

Dual‑model setup
- Perceiver: An omni‑modal LLM (e.g., CLIP‑based or Flamingo‑style) that ingests images, video frames, or other modalities and produces a textual context.
- Reasoner: A large language model specialized in chain‑of‑thought reasoning (e.g., GPT‑4, Claude).
Guidance Decoding
- During each generation step, the OLLM proposes a distribution over the next token.
- The LRM receives the same multimodal prompt (converted to text) and produces its own token distribution, reflecting pure reasoning.
- The two distributions are fused: the OLLM’s perception‑driven logits are scaled by a contrastive factor derived from the LRM’s logits, nudging the final output toward reasoning‑consistent tokens.
Stepwise Contrastive Scaling (SCS)
- Instead of a fixed weighting (e.g., 0.5 × perception + 0.5 × reasoning), SCS computes a dynamic scaling coefficient per decoding step based on the similarity between the two logits.
- When the LRM’s confidence is high, the scaling leans more heavily on reasoning; when the OLLM’s visual signal dominates, the system respects perception.
- This adaptive balance eliminates the need for exhaustive hyper‑parameter tuning across tasks.
Zero‑training pipeline
- The framework requires only the pre‑trained OLLM and LRM; no additional datasets, fine‑tuning loops, or gradient updates are performed.
- Implementation is a lightweight wrapper around the standard generation API, making it easy to drop into existing inference services.

Results & Findings

Benchmark	Baseline OLLM	ThinkOmni (OLLM + LRM)	Δ Improvement
MathVista	63.1	70.2	+7.1
MMAU	68.4	75.5	+7.1
VQA‑Reason	71.3	77.0	+5.7
ScienceQA‑MM	66.8	73.2	+6.4
DocVQA‑Multi	72.5	78.1	+5.6
Visual‑Commonsense	69.0	74.8	+5.8

Consistent uplift across domains (math, science, commonsense, document understanding).
Ablation studies show that removing SCS or the LRM guide drops performance back to baseline, confirming both components are essential.
Latency impact is modest: inference time grows ~1.3× due to the extra LRM pass, which is acceptable for many real‑time applications when weighed against the accuracy gains.

Practical Implications

Rapid capability boost: Companies can instantly enhance their multimodal products (e.g., visual assistants, educational bots) without costly retraining pipelines.
Modular AI stacks: ThinkOmni encourages a “best‑of‑both‑worlds” architecture where perception and reasoning modules are developed independently and combined at inference time.
Cost‑effective scaling: By reusing existing LLM APIs (e.g., OpenAI, Anthropic) as reasoning guides, developers avoid the massive GPU budgets typically required for multimodal fine‑tuning.
Improved safety & interpretability: The LRM’s chain‑of‑thought outputs can be logged alongside the final answer, offering a transparent reasoning trace that can be audited or used for debugging.
Edge‑to‑cloud hybrid deployments: The perception‑heavy OLLM can run on edge devices, while the LRM guide can be invoked in the cloud only when a complex reasoning step is detected, optimizing bandwidth and latency.

Limitations & Future Work

Dependency on LRM quality: The framework’s ceiling is bounded by the reasoning model’s capabilities; a weak LRM will limit gains.
Inference overhead: Running two large models in parallel doubles memory usage and increases latency, which may be prohibitive for low‑resource environments.
Modal conversion bottleneck: Current implementation translates non‑text modalities to textual descriptions for the LRM, potentially losing fine‑grained visual cues.
Future directions suggested by the authors include:
1. Exploring lightweight reasoning guides (e.g., distilled LRMs) to reduce compute.
2. Extending SCS to handle more than two modalities simultaneously.
3. Integrating feedback loops where the OLLM can request clarification from the LRM on ambiguous visual inputs.

Authors

Yiran Guan
Sifan Tu
Dingkang Liang
Linghao Zhu
Jianzhong Ju
Zhenbo Luo
Jian Luan
Yuliang Liu
Xiang Bai

Paper Information

arXiv ID: 2602.23306v1
Categories: cs.CV
Published: February 26, 2026
PDF: Download PDF

[Paper] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MediX-R1: Open Ended Medical Reinforcement Learning

[Paper] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB