[Paper] Hunt Globally: Deep Research AI Agents for Drug Asset Scouting in Investing, Business Development, and Search & Evaluation
Source: arXiv - 2602.15019v1
Overview
The paper introduces Bioptic, a tree‑structured, self‑learning AI agent designed to hunt for drug‑development assets hidden in the massive, multilingual, and often non‑English scientific and patent literature. By benchmarking against leading LLM‑based research tools, the authors show that Bioptic can dramatically improve recall while avoiding hallucinations—an essential capability for investors, business‑development teams, and venture capitalists who need to spot “under‑the‑radar” biotech opportunities before competitors do.
Key Contributions
- A new benchmarking framework for drug‑asset scouting that simulates real‑world investor queries, mixes languages, and uses LLM‑as‑judge scoring calibrated to expert opinions.
- Bioptic Agent architecture: a tree‑based, self‑learning “bioptic” (dual‑view) system that combines coarse‑grained retrieval with fine‑grained verification to achieve high recall without hallucination.
- Comprehensive empirical evaluation against five state‑of‑the‑art research agents (Claude Opus 4.6, GPT‑5.2 Pro, Gemini 3 Pro + Deep Research, Perplexity Deep Research, Exa Websets).
- Evidence that scaling compute (more retrieval passes, larger model inference) yields steep performance gains for this task.
- Open‑source‑ready pipeline for generating benchmark queries from real investor screening prompts, enabling reproducibility and future extensions.
Methodology
- Query Collection – The team gathered real screening prompts from biotech investors, business‑development (BD) professionals, and venture capitalists. These prompts serve as “priors” that reflect the complex, multi‑criteria nature of asset scouting.
- Synthetic Benchmark Generation – Using the priors, a conditional language model creates a large set of realistic, multilingual search queries. Each query is paired with a ground‑truth list of drug assets that are outside the typical U.S.-centric radar (e.g., Chinese patents, non‑English conference papers).
- Bioptic Agent Design –
- Coarse Retrieval Layer: a tree‑structured search over heterogeneous data sources (patent databases, regional journals, preprint servers) using multilingual embeddings.
- Fine Verification Layer: a second‑stage LLM that validates each candidate against source documents, filtering out hallucinated or irrelevant results.
- Self‑Learning Loop: feedback from the verification layer updates retrieval weights, allowing the system to improve over time without human re‑labeling.
- Evaluation – An LLM‑as‑judge model, calibrated with expert annotations, scores precision, recall, and F1 for each system on the benchmark.
Results & Findings
| System | F1 Score |
|---|---|
| Bioptic Agent | 79.7 % |
| Claude Opus 4.6 | 56.2 % |
| Gemini 3 Pro + Deep Research | 50.6 % |
| GPT‑5.2 Pro | 46.6 % |
| Perplexity Deep Research | 44.2 % |
| Exa Websets | 26.9 % |
- Recall boost: Bioptic recovers nearly 80 % of the hidden assets, a >20 % absolute gain over the strongest baseline.
- Hallucination control: The verification layer reduces false positives dramatically, keeping precision high even as recall rises.
- Compute scaling: Adding more retrieval passes (i.e., deeper tree exploration) yields a near‑linear improvement in F1, confirming that compute‑budget can be traded for coverage.
Practical Implications
- Accelerated deal sourcing – Investment teams can automate the first‑pass scan of global patent and literature feeds, surfacing promising candidates weeks earlier than manual scouting.
- Risk mitigation – By reliably surfacing non‑English assets, firms reduce the chance of missing a breakthrough that could affect valuation or competitive positioning.
- Integration‑friendly – The tree‑based architecture can be wrapped as a micro‑service, plugging into existing CRMs, deal‑flow platforms, or internal knowledge graphs.
- Cost‑effective scaling – Since performance scales with compute, organizations can start with modest resources (e.g., a few GPUs) and ramp up as the pipeline proves ROI.
- Cross‑domain reuse – The bioptic pattern (coarse retrieval + fine verification) is applicable to other high‑recall, low‑hallucination domains such as regulatory compliance, threat intelligence, or scientific literature reviews.
Limitations & Future Work
- Data freshness – The benchmark relies on static snapshots of patent and publication databases; real‑time updates (e.g., newly filed Chinese patents) could affect performance.
- Language coverage – While multilingual, the system currently favors languages with abundant pretrained embeddings; low‑resource languages may still be under‑represented.
- Compute cost – The steep gains with added compute come with higher operational expense, which may be prohibitive for smaller firms.
- Human‑in‑the‑loop validation – The study uses an LLM judge calibrated to experts, but a full user study with domain specialists would better quantify practical usability.
- Extending to therapeutic efficacy – Future work could integrate downstream data (e.g., clinical trial outcomes) to not just locate assets but also rank them by translational potential.
Authors
- Alisa Vinogradova
- Vlad Vinogradov
- Luba Greenwood
- Ilya Yasny
- Dmitry Kobyzev
- Shoman Kasbekar
- Kong Nguyen
- Dmitrii Radkevich
- Roman Doronin
- Andrey Doronichev
Paper Information
- arXiv ID: 2602.15019v1
- Categories: cs.AI, cs.IR
- Published: February 16, 2026
- PDF: Download PDF