[Paper] OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Published: 5 days ago (May 5, 2026 at 01:55 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.04036v1

Overview

OpenSeeker‑v2 demonstrates that a pure supervised‑fine‑tuning (SFT) pipeline—when fed with carefully crafted, high‑difficulty trajectories—can match or beat the performance of far more complex industry‑grade pipelines that combine continual pre‑training, SFT, and reinforcement learning. Using only 10.6 k synthetic examples, the authors push a 30 B‑parameter LLM to state‑of‑the‑art results on four widely‑used search‑agent benchmarks.

Key Contributions

Simple yet powerful data synthesis: three low‑cost modifications (larger knowledge graphs, expanded tool sets, strict low‑step filtering) that dramatically increase the informativeness of training trajectories.
Strong baseline with minimal data: achieves SOTA on BrowseComp, BrowseComp‑ZH, Humanity’s Last Exam, and xBench using only SFT, without any CPT or RL stages.
Open‑source release: model weights, data generation scripts, and evaluation code are publicly available, lowering the entry barrier for academic and hobbyist research on search agents.
Empirical evidence that “more difficult” training examples can compensate for the lack of massive compute‑heavy pipelines.

Methodology

Trajectory Generation – The authors start from a base knowledge graph (KG) and a toolbox of web‑search‑related APIs (e.g., browser, calculator).
- Scale up KG: they enlarge the graph to include many more entities and relations, forcing the agent to explore deeper reasoning paths.
- Expand tool set: additional APIs (e.g., translation, summarization) are added, encouraging multi‑tool coordination.
- Low‑step filtering: only trajectories that solve the task in ≤ 3 steps are kept, ensuring each step carries high informational load.
Supervised Fine‑Tuning – The 30 B LLM (initialized from a standard pre‑trained checkpoint) is fine‑tuned on the 10.6 k filtered trajectories using the ReAct prompting paradigm (i.e., interleaving reasoning and tool‑use actions). No reinforcement learning or continual pre‑training is performed.
Evaluation – The resulting model, OpenSeeker‑v2, is benchmarked on four search‑agent suites that test browsing, multilingual comprehension, complex reasoning, and general tool use.

Results & Findings

Benchmark	OpenSeeker‑v2	Tongyi DeepResearch (CPT+SFT+RL)
BrowseComp	46.0 %	43.4 %
BrowseComp‑ZH	58.1 %	46.7 %
Humanity’s Last Exam	34.6 %	32.9 %
xBench	78.0 %	75.0 %

Performance gain ranges from 2.7 % to 11.4 % absolute over a heavyweight industrial baseline.
The gap is achieved solely with SFT, confirming that high‑quality, high‑difficulty trajectories are a more critical factor than sheer training volume.
Ablation studies (not detailed in the abstract but present in the paper) show each of the three synthesis tweaks contributes positively; the low‑step filter yields the biggest boost.

Practical Implications

Lowered resource barrier: Teams without multi‑billion‑parameter compute can now train competitive search agents using modest GPU clusters and a few thousand synthetic examples.
Rapid prototyping: By swapping in domain‑specific KGs or custom tool APIs, developers can quickly adapt OpenSeeker‑v2 to niche search tasks (e.g., internal knowledge‑base retrieval, code‑base navigation).
Open‑source ecosystem: The released weights and data pipelines enable plug‑and‑play integration with existing LLM serving stacks (e.g., LangChain, Llama‑Index) and facilitate community‑driven benchmark extensions.
Tool‑use research: The findings encourage a shift toward trajectory quality engineering (designing harder, more informative examples) rather than defaulting to ever‑larger RL reward models.

Limitations & Future Work

Scale ceiling: The study focuses on a 30 B model; it remains unclear how the same SFT‑only recipe scales to smaller or much larger models.
Synthetic bias: Trajectories are generated from a knowledge graph and a fixed tool set, which may not capture the full diversity of real‑world web interactions.
Generalization to unseen tools: The model’s ability to incorporate brand‑new APIs without retraining has not been evaluated.
Future directions suggested by the authors include (1) expanding the KG with dynamic web‑crawled data, (2) exploring curriculum learning to gradually increase trajectory difficulty, and (3) combining the SFT baseline with lightweight RL fine‑tuning to further close the gap on the hardest benchmarks.

Authors

Yuwen Du
Rui Ye
Shuo Tang
Keduan Huang
Xinyu Zhu
Yuzhu Cai
Siheng Chen

Paper Information

arXiv ID: 2605.04036v1
Categories: cs.AI, cs.CL
Published: May 5, 2026
PDF: Download PDF

[Paper] OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims