[Paper] SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA
Source: arXiv - 2512.08867v1
Overview
The paper introduces SimpleDevQA, a new multilingual benchmark that evaluates how well large language models (LLMs) can answer development‑knowledge questions—those everyday queries developers ask that go beyond just writing code. By mining real‑world chat logs, the authors show that nearly 40 % of developer‑LLM interactions are knowledge‑seeking, yet existing benchmarks barely cover this space.
Key Contributions
- Real‑world insight: Analysis of the WildChat corpus reveals that development‑knowledge QA dominates developer‑LLM conversations, far out‑numbering pure code‑generation requests.
- Benchmark gap identification: Existing QA suites focus almost exclusively on code understanding and are often built from synthetic or curated queries, missing the broader knowledge needs of developers.
- SimpleDevQA pipeline: A three‑phase method (dialogue filtering → QA pair extraction → answer verification) that converts authentic multi‑turn chats into clean, short, verifiable QA pairs.
- Multilingual dataset: 2,740 QA pairs spanning English, Chinese, and Russian, each with a single, unambiguous answer.
- Empirical findings:
- Code‑specialized LLMs beat general‑purpose LLMs of comparable size.
- Retrieval‑augmented generation (RAG) lifts accuracy by an average of 11.3 %.
- LLMs tend to be over‑confident, and higher self‑reported confidence correlates with higher correctness.
- Strong code‑generation ability predicts stronger performance on development‑knowledge QA.
Methodology
- Data collection: The authors harvested multi‑turn dialogues between developers and LLMs from the public WildChat logs.
- Phase 1 – Dialogue filtering: They removed non‑knowledge‑seeking turns (e.g., pure code generation, chit‑chat) and kept only exchanges where the user asked for factual or conceptual information.
- Phase 2 – QA pair extraction: Each filtered exchange was distilled into a concise question and a short, verifiable answer. Ambiguous or multi‑sentence answers were discarded.
- Phase 3 – Answer verification: Automated checks (e.g., exact‑match against reference sources) and manual review ensured that each answer is correct and uniquely defined.
- Benchmark construction: The final set was split into English, Chinese, and Russian subsets, preserving the natural distribution of topics (API usage, debugging strategies, best‑practice guidelines, etc.).
The pipeline is deliberately lightweight—hence “SimpleDevQA”—so that the resulting benchmark can be used for fast, reproducible evaluation without heavy annotation overhead.
Results & Findings
| Model type | Baseline accuracy (no RAG) | +RAG boost | Observations |
|---|---|---|---|
| General‑purpose LLM (≈13B) | 42.1 % | +11.3 % → 53.4 % | Gains come from pulling up‑to‑date docs and StackOverflow snippets. |
| Code‑focused LLM (≈13B) | 48.7 % | +9.8 % → 58.5 % | Still ahead of general models even before retrieval. |
| Larger code LLM (≈34B) | 55.2 % | +10.1 % → 65.3 % | Scaling improves both code and knowledge QA. |
- Overconfidence: Models often assign high probability to wrong answers; calibration techniques are needed before deploying them in production.
- Confidence‑accuracy correlation: When a model’s self‑estimated confidence exceeds 80 %, its answer is correct about 70 % of the time, suggesting confidence can be used as a gating signal.
- Cross‑language consistency: Performance gaps between English and the other two languages are modest (≈5 % lower for Chinese/Russian), indicating the benchmark’s multilingual design is effective.
Practical Implications
- Better IDE assistants: By training or fine‑tuning on SimpleDevQA, code‑completion tools can answer “Why does this API throw X?” or “What’s the recommended pattern for X?” without needing a separate knowledge base.
- Improved chat‑ops bots: Customer‑support or internal dev‑ops bots can leverage RAG pipelines to retrieve up‑to‑date documentation, reducing reliance on brittle rule‑based answers.
- Confidence‑aware UI: UI designers can surface a model’s confidence score to developers, prompting them to verify answers when confidence is low, mitigating the overconfidence risk.
- Multilingual support: Companies with globally distributed dev teams can adopt a single model that handles English, Chinese, and Russian queries, simplifying maintenance.
- Benchmark‑driven hiring: Organizations can benchmark their internal LLMs on SimpleDevQA to gauge readiness for real‑world dev‑support tasks before rollout.
Limitations & Future Work
- Scope of knowledge: The benchmark emphasizes short, factual answers; more open‑ended, design‑oriented questions (e.g., “How should I architect X?”) remain uncovered.
- Dataset size: At 2.7 k pairs, SimpleDevQA is modest compared to massive code‑generation corpora; scaling up could reveal additional failure modes.
- Dynamic knowledge: Answers are static snapshots; future work could integrate time‑aware retrieval to handle evolving APIs and libraries.
- User intent modeling: The pipeline currently treats each filtered turn as a standalone QA pair; richer context handling (multi‑turn reasoning) is an open research direction.
By addressing these gaps, the community can move toward LLMs that not only write code but also serve as reliable, multilingual development knowledge partners.
Authors
- Jing Zhang
- Lianghong Guo
- Yanlin Wang
- Mingwei Liu
- Jiachi Chen
- Yuchi Ma
- Ensheng Shi
- Terry Yue Zhuo
- Hongyu Zhang
- Zibin Zheng
Paper Information
- arXiv ID: 2512.08867v1
- Categories: cs.SE
- Published: December 9, 2025
- PDF: Download PDF