[Paper] UCRBench: Benchmarking LLMs on Use Case Recovery
Source: arXiv - 2512.13360v1
Overview
The paper introduces UCRBench, the first large‑scale, code‑aligned benchmark for evaluating how well large language models (LLMs) can reverse‑engineer use cases—the textual specifications that describe what a software system should do—from real‑world source code. By grounding the benchmark in manually validated use cases from nine diverse projects, the authors provide a realistic yardstick for measuring LLMs’ ability to recover functional requirements, a task that directly impacts documentation, onboarding, and automated analysis pipelines.
Key Contributions
- UCRBench dataset: 1,200+ manually verified user‑goal and sub‑function use cases covering nine open‑source projects of varying size, domain, and architecture.
- Hierarchical evaluation protocol: A four‑tier metric suite (actor correctness, name accuracy, path fidelity, behavioral coverage) that quantifies both surface‑level and deep functional recovery.
- Comprehensive empirical study: Benchmarking of several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude, LLaMA‑2) on the use‑case recovery task, revealing systematic strengths and failure modes.
- Error taxonomy: Identification of common pitfalls such as omission of sub‑functions, inconsistent abstraction levels, and domain‑specific terminology gaps.
- Open‑source release: All benchmark data, evaluation scripts, and prompts are publicly available, enabling reproducibility and future extensions.
Methodology
- Project selection – Nine mature, publicly available software systems were chosen to span web services, CLI tools, libraries, and domain‑specific applications.
- Use‑case extraction & validation – Developers familiar with each codebase manually wrote user‑goal use cases (high‑level user stories) and decomposed them into sub‑function use cases (fine‑grained functional steps). Each entry was double‑checked for correctness against the actual code.
- Prompt design – For each target LLM, the authors crafted a concise “reverse‑engineer use case” prompt that supplies the relevant source files (or a summary) and asks the model to output a use case in a prescribed template.
- Hierarchical evaluation –
- Actor correctness: Does the generated use case mention the right primary actor(s)?
- Name accuracy: Are the action and object names semantically aligned with the code?
- Path fidelity: Does the sequence of steps follow the actual control‑flow/path in the implementation?
- Behavioral coverage: What proportion of the ground‑truth functional elements are captured (precision/recall)?
- Statistical analysis – Results are aggregated per project and per model, and significance testing is applied to compare performance across dimensions (e.g., single‑module vs. multi‑module systems).
Results & Findings
| Model | Avg. Actor Acc. | Avg. Name Acc. | Avg. Path Fidelity | Avg. Behavioral Coverage |
|---|---|---|---|---|
| GPT‑4 | 78 % | 71 % | 62 % | 55 % |
| Claude‑2 | 73 % | 66 % | 58 % | 48 % |
| LLaMA‑2‑13B | 61 % | 54 % | 44 % | 37 % |
- Partial success: All models can often identify the correct actor and produce plausible action verbs, but they frequently miss or misname domain‑specific objects (e.g., “OAuth token” vs. “access token”).
- Project variance: Performance on small, single‑module utilities (e.g., a CLI parser) is >20 % higher than on large, multi‑module web services with complex business logic.
- High omission rate: On average, 38 % of sub‑functions present in the ground‑truth are omitted, indicating that LLMs tend to generate concise but incomplete specifications.
- Abstraction drift: When asked to aggregate sub‑functions into a user‑goal use case, models often either over‑generalize (dropping essential steps) or under‑generalize (listing too many low‑level details).
- Domain‑specific vocabulary: Models trained primarily on general‑purpose code struggle with niche APIs (e.g., scientific computing libraries), leading to lower name accuracy.
Practical Implications
- Automated documentation pipelines – UCRBench shows that LLMs can be used as a first draft generator for use‑case documentation, but human review remains essential, especially for safety‑critical or domain‑specific systems.
- Onboarding & knowledge transfer – Teams can leverage LLM‑generated use cases to quickly surface high‑level functionality of legacy codebases, accelerating new‑developer ramp‑up.
- Requirement traceability tools – By integrating LLMs with issue‑tracking systems, developers could auto‑populate traceability matrices, linking code commits to recovered use cases.
- Test‑case generation – Accurate sub‑function recovery can feed downstream test‑case synthesis tools, reducing manual effort in building functional test suites.
- Model fine‑tuning – The identified gaps (domain terminology, multi‑module reasoning) suggest concrete data‑augmentation strategies: feeding more domain‑specific corpora and multi‑file context windows during fine‑tuning.
Limitations & Future Work
- Scope of projects – Although nine projects provide diversity, they still represent a limited slice of the software ecosystem; industrial proprietary codebases may exhibit different characteristics.
- Prompt sensitivity – The study uses a single prompt template per model; variations in prompt engineering could materially affect results, an area the authors plan to explore.
- Static analysis only – The benchmark relies on source code snapshots; dynamic behavior (runtime configuration, external services) is not captured, potentially under‑estimating the difficulty of full functional recovery.
- Future directions – Extending UCRBench to include multi‑language projects, evaluating retrieval‑augmented generation (RAG) pipelines, and investigating interactive “clarify‑and‑refine” loops where the model can ask follow‑up questions to improve use‑case fidelity.
Authors
- Shuyuan Xiao
- Yiran Zhang
- Weisong Sun
- Xiaohong Chen
- Yang Liu
- Zhi Jin
Paper Information
- arXiv ID: 2512.13360v1
- Categories: cs.SE
- Published: December 15, 2025
- PDF: Download PDF