[Paper] Q-ARE: An Evaluation Dataset for Query Based API Recommendation
Source: arXiv - 2605.00472v1
Overview
The paper “Q-ARE: An Evaluation Dataset for Query Based API Recommendation” tackles a pain point that many developers know all too well: wading through countless third‑party libraries to find the right API for a specific task. By releasing a carefully curated benchmark (Q‑ARE) and new evaluation metrics, the authors expose where current query‑based recommendation tools—and even large language models—still stumble, especially when the needed API lives several calls away from the developer’s entry point.
Key Contributions
- Q‑ARE dataset – A large, open‑source Java benchmark derived from real GitHub projects, linking query methods to the exact third‑party APIs they eventually invoke (directly or indirectly).
- Hierarchical call unification – A systematic process that collapses multi‑level invocation chains into a single “target API set,” preserving the true functional relationship.
- Two novel metrics
- API Call Depth – Counts how many method calls separate the query from the target API.
- Invocation Density – Measures the proportion of code lines in the call chain that belong to the target API.
- Comprehensive evaluation – Benchmarks several state‑of‑the‑art query‑based recommendation approaches and popular LLMs (e.g., GPT‑4, Claude) on Q‑ARE, revealing performance trends across depth and density.
- Insightful analysis – Shows that recommendation quality degrades sharply as depth increases and density drops, highlighting a blind spot for existing techniques.
Methodology
- Data collection – The authors mined thousands of Java repositories on GitHub, focusing on projects that import third‑party libraries.
- Method‑API mapping – For each method that a developer might query (the “source”), they performed static analysis to trace every call path until a third‑party API is reached.
- Recursive expansion – If a call leads to another internal method, the analysis continues recursively, building a full invocation tree.
- Target set unification – All leaf APIs reached through any path are merged into a single target set for that source method, eliminating duplicate or redundant entries.
- Metric computation
- API Call Depth = longest path length from source to any target API.
- Invocation Density = (lines of code belonging to target APIs) / (total lines in the call chain).
- Benchmarking – Existing query‑based recommendation tools (e.g., DeepAPI, Code2API) and several LLMs were fed the source method’s natural‑language description and asked to rank candidate APIs. Their rankings were scored against the ground‑truth target sets using standard precision/recall and the new depth/density lenses.
Results & Findings
| Scenario | Top‑1 Accuracy | Trend |
|---|---|---|
| Depth = 1 (direct calls) | ~78% (best LLM) | High performance; most tools can spot obvious APIs. |
| Depth = 2‑3 | ~45% | Accuracy halves once the needed API is a couple of calls away. |
| Depth ≥ 4 | < 20% | Severe degradation; models rarely infer deep call chains. |
| High Invocation Density (≥ 0.6) | ~60% | When the target API dominates the call chain, recommendations improve. |
| Low Invocation Density (< 0.3) | < 30% | Sparse API usage hurts all methods. |
- LLMs vs. specialized tools – Large language models outperform traditional query‑based systems on shallow queries but still suffer the same depth‑related drop‑off.
- Error analysis – Most failures stem from missing the semantic bridge—the intermediate utility methods that adapt generic data structures to the API’s specific types.
- Metric validation – Both API Call Depth and Invocation Density correlate strongly (Pearson ≈ 0.71) with observed performance, confirming they capture meaningful difficulty dimensions.
Practical Implications
- Tool developers – Incorporate call‑graph awareness in IDE assistants or search engines. Simple keyword matching won’t cut it for “indirect” API needs; consider static analysis or graph‑neural networks that can reason over multi‑hop relationships.
- LLM prompt engineering – Prompt designers should explicitly ask the model to “trace the call chain” or provide intermediate method signatures, which can coax deeper reasoning.
- API providers – Document not only the public methods but also typical usage patterns (e.g., helper utilities) that often sit between user code and the core API—this can improve discoverability in recommendation pipelines.
- Developer onboarding – Teams can use Q‑ARE as a training set to fine‑tune custom models that understand their own codebase’s idioms, leading to more accurate suggestions in proprietary libraries.
- Benchmarking standards – Q‑ARE offers a realistic, open benchmark for the community to compare new approaches, moving beyond synthetic or single‑level datasets that over‑estimate performance.
Limitations & Future Work
- Language scope – The dataset is limited to Java; extending the methodology to Python, JavaScript, or Rust would test cross‑language generality.
- Static analysis only – Dynamic behaviors (reflection, runtime code generation) are not captured, potentially omitting some real‑world API usages.
- Ground‑truth granularity – Treating all leaf APIs as equally relevant ignores cases where only a subset is truly needed for the developer’s intent.
- Scalability of metrics – Computing Invocation Density for massive codebases can be costly; approximate or sampling‑based methods could be explored.
- Model improvements – Future work could integrate call‑graph embeddings directly into LLM fine‑tuning, or develop hybrid systems that combine neural ranking with symbolic analysis to better handle deep invocation structures.
Authors
- Shenglong Wu
- Xunhui Zhang
- Tao Wang
Paper Information
- arXiv ID: 2605.00472v1
- Categories: cs.SE
- Published: May 1, 2026
- PDF: Download PDF