[Paper] ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation
Source: arXiv - 2602.18306v1
Overview
The paper introduces ReqElicitGym, a sandbox‑style evaluation platform that lets researchers and engineers automatically test how well large language models (LLMs) can “interview” users to uncover software requirements. By providing a rich set of simulated user interactions and an objective scoring system, the authors make it possible to benchmark conversational requirement‑elicitation agents in a reproducible, quantitative way—something that has been missing from current practice.
Key Contributions
- ReqElicitGym environment: an interactive, fully‑automated testbed that mimics real users (oracle user) and grades the elicitation performance (task evaluator).
- Large, diverse dataset: 101 end‑to‑end website‑building scenarios covering 10 different application domains (e.g., e‑commerce, blogs, dashboards).
- High fidelity validation: both the oracle user and evaluator achieve strong agreement with actual human users and expert judgments, confirming the realism of the simulation.
- Comprehensive empirical study: systematic comparison of seven popular LLMs (including GPT‑4, Claude, Llama 2, etc.) on the new benchmark, revealing concrete strengths and weaknesses.
- Open‑source release: the code, data, and evaluation scripts are publicly available, enabling the community to plug in new models and extend the benchmark.
Methodology
- Scenario construction – Domain experts authored 101 requirement‑elicitation scripts that describe a target website, its functional goals, and a set of implicit requirements (e.g., “the UI should feel modern”).
- Oracle user simulation – A rule‑based “oracle” model reads the scenario and answers any question posed by the LLM agent as a real user would, providing consistent, deterministic responses.
- Task evaluator – After a dialogue finishes, the evaluator compares the set of requirements the LLM claimed to have gathered against the ground‑truth list, computing precision, recall, and an overall “interview competence” score.
- Interaction loop – The LLM under test can ask follow‑up questions, request clarifications, or propose design ideas, just like in a real interview. The loop continues until a termination condition (e.g., max turns) is met.
- Human validation – A subset of dialogues was also run with actual users and domain experts to verify that the simulated oracle and evaluator produce comparable judgments.
Results & Findings
- Overall competence is modest – Across all models, the average recall for implicit requirements hovers around 45 %, meaning more than half of the hidden needs remain undiscovered.
- Late‑turn advantage – Effective elicitation questions tend to appear after the 5th turn, suggesting that LLMs need longer conversations to surface deeper requirements.
- Strengths vs. weaknesses
- Strength: LLMs are fairly good at extracting interaction (e.g., “user can upload files”) and content requirements (e.g., “display product reviews”).
- Weakness: They consistently miss style‑related requirements (e.g., “use a minimalist design”) and other nuanced non‑functional aspects.
- Model ranking – GPT‑4 achieved the highest competence score, but even it uncovered less than half of the implicit requirements, indicating a systemic gap rather than a single‑model issue.
Practical Implications
- Tooling for developers – ReqElicitGym can be integrated into CI pipelines for AI‑assistant products, automatically flagging when a new model version regresses in interview ability.
- Prompt engineering – The findings highlight the need for more sophisticated prompting strategies (e.g., “ask about aesthetic preferences early”) to improve coverage of non‑functional requirements.
- Product management – Teams building LLM‑driven requirement‑gathering bots can now benchmark against a standard, reducing reliance on costly user studies for early prototyping.
- Education & training – The dataset can serve as a teaching resource for software engineering courses that want to illustrate the challenges of eliciting hidden requirements from stakeholders.
Limitations & Future Work
- Domain scope – The benchmark focuses on website development; extending to mobile apps, enterprise systems, or embedded software may surface different challenges.
- Oracle realism – While validated against human users, the oracle still follows deterministic rules and may not capture the full variability of real stakeholder behavior (e.g., ambiguous answers, changing goals).
- Metric granularity – Current scores treat all implicit requirements equally; future work could weight functional vs. non‑functional needs or incorporate user satisfaction metrics.
- Model diversity – The study covered seven LLMs; evaluating emerging multimodal or retrieval‑augmented models could reveal new patterns.
Bottom line: ReqElicitGym fills a critical gap by giving the community a reliable playground to measure and improve the interview skills of conversational AI agents—an essential step toward truly autonomous software development pipelines.
Authors
- Dongming Jin
- Zhi Jin
- Zheng Fang
- Linyu Li
- XiaoTian Yang
- Yuanpeng He
- Xiaohong Chen
Paper Information
- arXiv ID: 2602.18306v1
- Categories: cs.SE
- Published: February 20, 2026
- PDF: Download PDF