[Paper] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Published: (February 26, 2026 at 01:10 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23306v1

Overview

The paper introduces ThinkOmni, a plug‑and‑play framework that lets existing omni‑modal large language models (OLLMs) inherit the sophisticated reasoning abilities of state‑of‑the‑art large reasoning models (LRMs) – without any extra training or data collection. By treating a powerful LRM as a “reasoning guide” during inference, ThinkOmni bridges the gap between perception‑heavy multimodal models and the deep logical chains required for tasks such as math, commonsense, and visual question answering.

Key Contributions

  • Training‑free reasoning augmentation: Enables OLLMs to perform complex textual reasoning in multimodal contexts without fine‑tuning.
  • LRM‑as‑a‑Guide: A novel inference‑time decoding strategy that consults an off‑the‑shelf LRM to steer the OLLM’s token generation.
  • Stepwise Contrastive Scaling (SCS): An adaptive mechanism that automatically balances visual‑perceptual signals and textual‑reasoning cues, eliminating manual hyper‑parameter sweeps.
  • Broad empirical validation: Consistent gains across six diverse multimodal reasoning benchmarks (e.g., MathVista, MMAU), achieving new state‑of‑the‑art scores (70.2 on MathVista, 75.5 on MMAU).
  • General‑purpose recipe: Works with any compatible OLLM/LRM pair, making it a reusable “add‑on” for existing AI services.

Methodology

  1. Dual‑model setup

    • Perceiver: An omni‑modal LLM (e.g., CLIP‑based or Flamingo‑style) that ingests images, video frames, or other modalities and produces a textual context.
    • Reasoner: A large language model specialized in chain‑of‑thought reasoning (e.g., GPT‑4, Claude).
  2. Guidance Decoding

    • During each generation step, the OLLM proposes a distribution over the next token.
    • The LRM receives the same multimodal prompt (converted to text) and produces its own token distribution, reflecting pure reasoning.
    • The two distributions are fused: the OLLM’s perception‑driven logits are scaled by a contrastive factor derived from the LRM’s logits, nudging the final output toward reasoning‑consistent tokens.
  3. Stepwise Contrastive Scaling (SCS)

    • Instead of a fixed weighting (e.g., 0.5 × perception + 0.5 × reasoning), SCS computes a dynamic scaling coefficient per decoding step based on the similarity between the two logits.
    • When the LRM’s confidence is high, the scaling leans more heavily on reasoning; when the OLLM’s visual signal dominates, the system respects perception.
    • This adaptive balance eliminates the need for exhaustive hyper‑parameter tuning across tasks.
  4. Zero‑training pipeline

    • The framework requires only the pre‑trained OLLM and LRM; no additional datasets, fine‑tuning loops, or gradient updates are performed.
    • Implementation is a lightweight wrapper around the standard generation API, making it easy to drop into existing inference services.

Results & Findings

BenchmarkBaseline OLLMThinkOmni (OLLM + LRM)Δ Improvement
MathVista63.170.2+7.1
MMAU68.475.5+7.1
VQA‑Reason71.377.0+5.7
ScienceQA‑MM66.873.2+6.4
DocVQA‑Multi72.578.1+5.6
Visual‑Commonsense69.074.8+5.8
  • Consistent uplift across domains (math, science, commonsense, document understanding).
  • Ablation studies show that removing SCS or the LRM guide drops performance back to baseline, confirming both components are essential.
  • Latency impact is modest: inference time grows ~1.3× due to the extra LRM pass, which is acceptable for many real‑time applications when weighed against the accuracy gains.

Practical Implications

  • Rapid capability boost: Companies can instantly enhance their multimodal products (e.g., visual assistants, educational bots) without costly retraining pipelines.
  • Modular AI stacks: ThinkOmni encourages a “best‑of‑both‑worlds” architecture where perception and reasoning modules are developed independently and combined at inference time.
  • Cost‑effective scaling: By reusing existing LLM APIs (e.g., OpenAI, Anthropic) as reasoning guides, developers avoid the massive GPU budgets typically required for multimodal fine‑tuning.
  • Improved safety & interpretability: The LRM’s chain‑of‑thought outputs can be logged alongside the final answer, offering a transparent reasoning trace that can be audited or used for debugging.
  • Edge‑to‑cloud hybrid deployments: The perception‑heavy OLLM can run on edge devices, while the LRM guide can be invoked in the cloud only when a complex reasoning step is detected, optimizing bandwidth and latency.

Limitations & Future Work

  • Dependency on LRM quality: The framework’s ceiling is bounded by the reasoning model’s capabilities; a weak LRM will limit gains.
  • Inference overhead: Running two large models in parallel doubles memory usage and increases latency, which may be prohibitive for low‑resource environments.
  • Modal conversion bottleneck: Current implementation translates non‑text modalities to textual descriptions for the LRM, potentially losing fine‑grained visual cues.
  • Future directions suggested by the authors include:
    1. Exploring lightweight reasoning guides (e.g., distilled LRMs) to reduce compute.
    2. Extending SCS to handle more than two modalities simultaneously.
    3. Integrating feedback loops where the OLLM can request clarification from the LRM on ambiguous visual inputs.

Authors

  • Yiran Guan
  • Sifan Tu
  • Dingkang Liang
  • Linghao Zhu
  • Jianzhong Ju
  • Zhenbo Luo
  • Jian Luan
  • Yuliang Liu
  • Xiang Bai

Paper Information

  • arXiv ID: 2602.23306v1
  • Categories: cs.CV
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...