[Paper] ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation

Published: (February 20, 2026 at 11:02 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18306v1

Overview

The paper introduces ReqElicitGym, a sandbox‑style evaluation platform that lets researchers and engineers automatically test how well large language models (LLMs) can “interview” users to uncover software requirements. By providing a rich set of simulated user interactions and an objective scoring system, the authors make it possible to benchmark conversational requirement‑elicitation agents in a reproducible, quantitative way—something that has been missing from current practice.

Key Contributions

  • ReqElicitGym environment: an interactive, fully‑automated testbed that mimics real users (oracle user) and grades the elicitation performance (task evaluator).
  • Large, diverse dataset: 101 end‑to‑end website‑building scenarios covering 10 different application domains (e.g., e‑commerce, blogs, dashboards).
  • High fidelity validation: both the oracle user and evaluator achieve strong agreement with actual human users and expert judgments, confirming the realism of the simulation.
  • Comprehensive empirical study: systematic comparison of seven popular LLMs (including GPT‑4, Claude, Llama 2, etc.) on the new benchmark, revealing concrete strengths and weaknesses.
  • Open‑source release: the code, data, and evaluation scripts are publicly available, enabling the community to plug in new models and extend the benchmark.

Methodology

  1. Scenario construction – Domain experts authored 101 requirement‑elicitation scripts that describe a target website, its functional goals, and a set of implicit requirements (e.g., “the UI should feel modern”).
  2. Oracle user simulation – A rule‑based “oracle” model reads the scenario and answers any question posed by the LLM agent as a real user would, providing consistent, deterministic responses.
  3. Task evaluator – After a dialogue finishes, the evaluator compares the set of requirements the LLM claimed to have gathered against the ground‑truth list, computing precision, recall, and an overall “interview competence” score.
  4. Interaction loop – The LLM under test can ask follow‑up questions, request clarifications, or propose design ideas, just like in a real interview. The loop continues until a termination condition (e.g., max turns) is met.
  5. Human validation – A subset of dialogues was also run with actual users and domain experts to verify that the simulated oracle and evaluator produce comparable judgments.

Results & Findings

  • Overall competence is modest – Across all models, the average recall for implicit requirements hovers around 45 %, meaning more than half of the hidden needs remain undiscovered.
  • Late‑turn advantage – Effective elicitation questions tend to appear after the 5th turn, suggesting that LLMs need longer conversations to surface deeper requirements.
  • Strengths vs. weaknesses
    • Strength: LLMs are fairly good at extracting interaction (e.g., “user can upload files”) and content requirements (e.g., “display product reviews”).
    • Weakness: They consistently miss style‑related requirements (e.g., “use a minimalist design”) and other nuanced non‑functional aspects.
  • Model ranking – GPT‑4 achieved the highest competence score, but even it uncovered less than half of the implicit requirements, indicating a systemic gap rather than a single‑model issue.

Practical Implications

  • Tooling for developers – ReqElicitGym can be integrated into CI pipelines for AI‑assistant products, automatically flagging when a new model version regresses in interview ability.
  • Prompt engineering – The findings highlight the need for more sophisticated prompting strategies (e.g., “ask about aesthetic preferences early”) to improve coverage of non‑functional requirements.
  • Product management – Teams building LLM‑driven requirement‑gathering bots can now benchmark against a standard, reducing reliance on costly user studies for early prototyping.
  • Education & training – The dataset can serve as a teaching resource for software engineering courses that want to illustrate the challenges of eliciting hidden requirements from stakeholders.

Limitations & Future Work

  • Domain scope – The benchmark focuses on website development; extending to mobile apps, enterprise systems, or embedded software may surface different challenges.
  • Oracle realism – While validated against human users, the oracle still follows deterministic rules and may not capture the full variability of real stakeholder behavior (e.g., ambiguous answers, changing goals).
  • Metric granularity – Current scores treat all implicit requirements equally; future work could weight functional vs. non‑functional needs or incorporate user satisfaction metrics.
  • Model diversity – The study covered seven LLMs; evaluating emerging multimodal or retrieval‑augmented models could reveal new patterns.

Bottom line: ReqElicitGym fills a critical gap by giving the community a reliable playground to measure and improve the interview skills of conversational AI agents—an essential step toward truly autonomous software development pipelines.

Authors

  • Dongming Jin
  • Zhi Jin
  • Zheng Fang
  • Linyu Li
  • XiaoTian Yang
  • Yuanpeng He
  • Xiaohong Chen

Paper Information

  • arXiv ID: 2602.18306v1
  • Categories: cs.SE
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »