[Paper] OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Published: 4 days ago (May 6, 2026 at 01:50 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.05185v1

Overview

OpenSearch‑VL is a fully open‑source recipe for building state‑of‑the‑art multimodal search agents—models that can look up text and images, verify evidence, and reason over multiple steps. By releasing the data pipelines, tool environment, and training algorithms, the authors make it possible for anyone to reproduce and extend capabilities that were previously locked behind proprietary systems.

Key Contributions

Open training pipelines that generate high‑quality multimodal data from Wikipedia using path sampling, fuzzy entity rewriting, and visual grounding.
Two curated datasets:
- SearchVL‑SFT‑36k for supervised fine‑tuning (SFT).
- SearchVL‑RL‑8k for reinforcement‑learning (RL) of agentic behavior.
A unified multimodal tool suite (text search, image search, OCR, cropping, sharpening, super‑resolution, perspective correction) that lets agents interact with external resources in a plug‑and‑play fashion.
Fatal‑aware GRPO algorithm, a reinforcement‑learning method that gracefully handles tool failures by masking post‑failure tokens while still crediting useful pre‑failure reasoning.
Strong empirical results: >10 % absolute gains on seven multimodal benchmarks and performance on par with commercial black‑box models on several tasks.
Full open‑source release of data, code, and pretrained models to foster reproducible research.

Methodology

Data Construction – The authors start from Wikipedia articles and sample paths that link concepts (e.g., “Mars → Olympus Mons → volcanic activity”). They then apply fuzzy entity rewriting to avoid trivial shortcuts (e.g., swapping synonyms) and anchor visual evidence by linking text spans to corresponding images. This yields diverse, multi‑step queries that require both retrieval and reasoning.
Tool Environment – A sandbox provides a common API for a suite of perception and search tools. An agent can issue commands such as search_text("quantum tunneling") or ocr(image_id), receive results, and feed them back into its reasoning loop.
Training Regime –
- Supervised Fine‑Tuning (SFT) on the 36k examples teaches the model the basic pattern of “question → tool calls → answer”.
- Reinforcement Learning (RL) with the fatal‑aware GRPO objective refines the policy to maximize long‑term reward (correct answer) while penalizing sequences that cause tool crashes. The algorithm masks tokens after a failure, preventing the model from learning from corrupted outputs, yet still credits the reasoning that led up to the failure via a one‑sided advantage clamp.
Evaluation – The trained agents are benchmarked on seven multimodal search tasks (e.g., visual question answering with external knowledge, image‑grounded fact verification, OCR‑driven reasoning).

Results & Findings

Performance boost: Across all seven benchmarks, OpenSearch‑VL outperforms prior open baselines by an average of 10.3 % absolute in accuracy or F1 score.
Parity with closed‑source systems: On three of the benchmarks (e.g., Web‑Image QA, Multi‑Modal Fact Checking), the open model matches or exceeds the results reported for commercial APIs such as GPT‑4V or Claude‑Vision.
Robustness to tool failures: The fatal‑aware GRPO training reduces catastrophic error propagation; agents recover more gracefully after a failed OCR or search call, leading to a ≈15 % reduction in overall failure rate.
Ablation insights: Removing fuzzy entity rewriting drops performance by ~4 %, while omitting visual grounding harms image‑heavy tasks by up to 7 %. The tool suite’s diversity (especially super‑resolution) contributes noticeably to tasks that require high‑resolution visual details.

Practical Implications

Rapid prototyping of multimodal assistants – Developers can plug the released tool suite into their own LLM back‑ends (e.g., Llama‑3, Claude) and fine‑tune with the provided datasets, gaining search‑augmented capabilities without building data pipelines from scratch.
Enterprise knowledge retrieval – Companies with internal document and image repositories can adapt the Wikipedia‑based pipeline to their own corpora, enabling agents that fetch, verify, and synthesize information across text and visual assets.
Enhanced UI/UX for AI‑powered products – The ability to call OCR, cropping, or super‑resolution on‑the‑fly lets products automatically clean up scanned documents, extract tables, or improve low‑res screenshots before answering user queries.
Cost‑effective alternative to proprietary APIs – OpenSearch‑VL’s comparable performance means startups can avoid expensive per‑call fees while still offering high‑quality multimodal search features.
Research acceleration – With the full recipe public, the community can experiment with new tools (e.g., video retrieval) or alternative RL objectives, fostering a faster iteration loop in multimodal agent research.

Limitations & Future Work

Scale of training data – The curated datasets (36k SFT, 8k RL) are modest compared to the billions of examples used by commercial models; scaling up may yield further gains.
Domain specificity – The pipeline is tuned for Wikipedia‑style knowledge; applying it to highly specialized domains (medical imaging, legal documents) may require additional curation steps.
Tool reliability – Although the fatal‑aware GRPO mitigates failures, the underlying tools (search APIs, OCR engines) still introduce latency and occasional inaccuracies that can affect real‑time applications.
Evaluation breadth – Benchmarks focus on static image and text retrieval; extending evaluation to video, 3‑D data, or interactive environments remains an open avenue.
Future directions proposed by the authors include expanding the toolset (e.g., multimodal translation, speech‑to‑text), integrating larger LLM backbones, and exploring curriculum‑based RL to further improve multi‑step reasoning robustness.

Authors

Shuang Chen
Kaituo Feng
Hangting Chen
Wenxuan Huang
Dasen Dai
Quanxin Shou
Yunlong Lin
Xiangyu Yue
Shenghua Gao
Tianyu Pang

Paper Information

arXiv ID: 2605.05185v1
Categories: cs.CV
Published: May 6, 2026
PDF: Download PDF

[Paper] OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment