[Paper] OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Source: arXiv - 2605.04036v1
Overview
OpenSeeker‑v2 demonstrates that a pure supervised‑fine‑tuning (SFT) pipeline—when fed with carefully crafted, high‑difficulty trajectories—can match or beat the performance of far more complex industry‑grade pipelines that combine continual pre‑training, SFT, and reinforcement learning. Using only 10.6 k synthetic examples, the authors push a 30 B‑parameter LLM to state‑of‑the‑art results on four widely‑used search‑agent benchmarks.
Key Contributions
- Simple yet powerful data synthesis: three low‑cost modifications (larger knowledge graphs, expanded tool sets, strict low‑step filtering) that dramatically increase the informativeness of training trajectories.
- Strong baseline with minimal data: achieves SOTA on BrowseComp, BrowseComp‑ZH, Humanity’s Last Exam, and xBench using only SFT, without any CPT or RL stages.
- Open‑source release: model weights, data generation scripts, and evaluation code are publicly available, lowering the entry barrier for academic and hobbyist research on search agents.
- Empirical evidence that “more difficult” training examples can compensate for the lack of massive compute‑heavy pipelines.
Methodology
-
Trajectory Generation – The authors start from a base knowledge graph (KG) and a toolbox of web‑search‑related APIs (e.g., browser, calculator).
- Scale up KG: they enlarge the graph to include many more entities and relations, forcing the agent to explore deeper reasoning paths.
- Expand tool set: additional APIs (e.g., translation, summarization) are added, encouraging multi‑tool coordination.
- Low‑step filtering: only trajectories that solve the task in ≤ 3 steps are kept, ensuring each step carries high informational load.
-
Supervised Fine‑Tuning – The 30 B LLM (initialized from a standard pre‑trained checkpoint) is fine‑tuned on the 10.6 k filtered trajectories using the ReAct prompting paradigm (i.e., interleaving reasoning and tool‑use actions). No reinforcement learning or continual pre‑training is performed.
-
Evaluation – The resulting model, OpenSeeker‑v2, is benchmarked on four search‑agent suites that test browsing, multilingual comprehension, complex reasoning, and general tool use.
Results & Findings
| Benchmark | OpenSeeker‑v2 | Tongyi DeepResearch (CPT+SFT+RL) |
|---|---|---|
| BrowseComp | 46.0 % | 43.4 % |
| BrowseComp‑ZH | 58.1 % | 46.7 % |
| Humanity’s Last Exam | 34.6 % | 32.9 % |
| xBench | 78.0 % | 75.0 % |
- Performance gain ranges from 2.7 % to 11.4 % absolute over a heavyweight industrial baseline.
- The gap is achieved solely with SFT, confirming that high‑quality, high‑difficulty trajectories are a more critical factor than sheer training volume.
- Ablation studies (not detailed in the abstract but present in the paper) show each of the three synthesis tweaks contributes positively; the low‑step filter yields the biggest boost.
Practical Implications
- Lowered resource barrier: Teams without multi‑billion‑parameter compute can now train competitive search agents using modest GPU clusters and a few thousand synthetic examples.
- Rapid prototyping: By swapping in domain‑specific KGs or custom tool APIs, developers can quickly adapt OpenSeeker‑v2 to niche search tasks (e.g., internal knowledge‑base retrieval, code‑base navigation).
- Open‑source ecosystem: The released weights and data pipelines enable plug‑and‑play integration with existing LLM serving stacks (e.g., LangChain, Llama‑Index) and facilitate community‑driven benchmark extensions.
- Tool‑use research: The findings encourage a shift toward trajectory quality engineering (designing harder, more informative examples) rather than defaulting to ever‑larger RL reward models.
Limitations & Future Work
- Scale ceiling: The study focuses on a 30 B model; it remains unclear how the same SFT‑only recipe scales to smaller or much larger models.
- Synthetic bias: Trajectories are generated from a knowledge graph and a fixed tool set, which may not capture the full diversity of real‑world web interactions.
- Generalization to unseen tools: The model’s ability to incorporate brand‑new APIs without retraining has not been evaluated.
- Future directions suggested by the authors include (1) expanding the KG with dynamic web‑crawled data, (2) exploring curriculum learning to gradually increase trajectory difficulty, and (3) combining the SFT baseline with lightweight RL fine‑tuning to further close the gap on the hardest benchmarks.
Authors
- Yuwen Du
- Rui Ye
- Shuo Tang
- Keduan Huang
- Xinyu Zhu
- Yuzhu Cai
- Siheng Chen
Paper Information
- arXiv ID: 2605.04036v1
- Categories: cs.AI, cs.CL
- Published: May 5, 2026
- PDF: Download PDF