[Paper] Question Answering for Multi-Release Systems: A Case Study at Ciena

Published: (January 5, 2026 at 01:44 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02345v1

Overview

The paper tackles a real‑world pain point for software vendors and large enterprises: answering developer or operator questions when multiple versions of a product are in the field at the same time. Traditional retrieval‑augmented generation (RAG) chatbots stumble on “multi‑release” documentation because the texts for different releases are almost identical yet contain subtle, version‑specific differences. The authors introduce QAMR, a chatbot that adapts RAG to reliably surface the right answer for the right release, and they validate it on both a public benchmark and a proprietary Ciena dataset.

Key Contributions

  • QAMR architecture that extends standard RAG with pre‑processing, query rewriting, and smart context selection to disambiguate overlapping release docs.
  • Dual‑chunking strategy: separate chunk sizes for the retrieval stage and the generation stage, allowing each to be tuned independently for optimal performance.
  • Empirical validation on a public SE benchmark and a large, real‑world multi‑release corpus from Ciena, showing substantial gains over a strong baseline.
  • Comprehensive ablation study demonstrating the individual impact of each QAMR component on answer correctness and retrieval accuracy.
  • Correlation analysis confirming that automatically computed metrics align closely with expert human judgments, supporting the reliability of the evaluation pipeline.

Methodology

  1. Document Pre‑processing – The raw multi‑release manuals are first normalized (e.g., version tags stripped, duplicate sections collapsed) to reduce noise while preserving version‑specific cues.
  2. Query Rewriting – When a user asks a question, a lightweight classifier detects whether the query mentions a release (explicitly or implicitly) and rewrites it to include the appropriate version identifier.
  3. Context Selection – Instead of feeding the entire retrieved passage to the generator, QAMR selects a release‑focused subset using a similarity‑aware ranker that penalizes cross‑release overlap.
  4. Dual‑Chunking – Retrieval operates on relatively large chunks (≈300‑500 words) to capture enough context for accurate matching, while the generation model receives smaller, fine‑grained chunks (≈100‑150 words) to keep the prompt concise and reduce hallucination.
  5. Answer Generation – A standard large language model (LLM) is prompted with the rewritten query and the selected generation chunk, producing the final answer.
  6. Evaluation – Accuracy is measured both at the retrieval level (did the system fetch the correct release doc?) and at the answer level (was the answer factually correct?). Human experts also rated a sample to verify metric validity.

Results & Findings

MetricBaseline RAGQAMR
Answer correctness (average)72.0 %88.5 % (+16.5 pp)
Retrieval accuracy (average)78 %90 % (+12 pp)
Response time (average)1.20 s1.10 s (‑8 %)
  • Ablation impact: Removing query rewriting dropped answer correctness by ~7 pp; disabling dual‑chunking reduced retrieval accuracy by ~5 pp. The best single‑component variant still lagged the full QAMR by ~19.6 % (answer) and ~14.0 % (retrieval).
  • Human vs. automatic scores: Pearson correlation > 0.92, indicating the automated metrics are trustworthy proxies for expert evaluation.

Practical Implications

  • Reduced support overhead: Companies can deploy QAMR‑powered assistants to field version‑specific queries from engineers, field technicians, or customers without maintaining separate bots per release.
  • Faster onboarding: New hires can ask “How do I configure feature X in release 7.3?” and receive precise guidance, cutting down documentation search time.
  • Improved CI/CD tooling: Integration with internal ticketing or chat platforms (e.g., Slack, Teams) enables automated “release‑aware” troubleshooting bots that fetch the right config snippets or migration steps.
  • Scalable knowledge management: The dual‑chunking approach lets organizations keep a single, unified documentation repository while still delivering accurate, release‑targeted answers.
  • Potential for other domains: Any product line with overlapping manuals—hardware firmware, API versions, regulatory compliance guides—can benefit from the same pipeline.

Limitations & Future Work

  • Dependence on explicit version cues: QAMR performs best when the query or the document contains clear release identifiers; ambiguous phrasing can still lead to mis‑selection.
  • Manual tuning of chunk sizes: The optimal retrieval and generation chunk lengths were empirically chosen for the Ciena dataset; automated tuning or adaptive chunking could improve portability.
  • LLM hallucination risk: Although the dual‑chunking reduces hallucinations, the underlying generator can still produce plausible‑but‑incorrect statements if the retrieved context is noisy.
  • Evaluation scope: The study focuses on a single industry partner; broader validation across diverse software stacks (e.g., open‑source libraries, cloud services) would strengthen generalizability.
  • Future directions: The authors suggest exploring end‑to‑end trainable retrieval‑generation models that learn release disambiguation jointly, and incorporating user feedback loops to continuously refine the query‑rewriting component.

Authors

  • Parham Khamsepour
  • Mark Cole
  • Ish Ashraf
  • Sandeep Puri
  • Mehrdad Sabetzadeh
  • Shiva Nejati

Paper Information

  • arXiv ID: 2601.02345v1
  • Categories: cs.SE
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »