[Paper] RITA: A Tool for Automated Requirements Classification and Specification from Online User Feedback

Published: (January 16, 2026 at 10:18 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.11362v1

Overview

The paper introduces RITA, an open‑source tool that stitches together several lightweight large language models (LLMs) to turn noisy, high‑volume online user feedback into clean, actionable software requirements. By providing an end‑to‑end workflow—from classification of feedback items to generation of formal requirement specifications and direct export to Jira—RITA aims to make requirements engineering (RE) practical for modern development teams that already live in a feedback‑rich ecosystem.

Key Contributions

  • Unified RE pipeline that combines three LLM‑driven tasks (request classification, non‑functional requirement (NFR) detection, and natural‑language specification generation) into a single, easy‑to‑use interface.
  • Lightweight, open‑source LLM integration (e.g., distilled versions of GPT‑2/3‑like models) that run locally or on modest cloud resources, lowering the barrier to adoption.
  • Bidirectional Jira integration, allowing automatically generated requirement tickets to be pushed directly into existing agile workflows.
  • Demonstrated usability through a short video demo and a prototype web UI that lets product managers and developers explore the tool without any RE expertise.
  • Empirical grounding: each LLM component builds on previously validated RE techniques, showing that research‑grade models can be repurposed for production‑grade tooling.

Methodology

  1. Data Ingestion – RITA pulls raw feedback from public sources (e.g., app store reviews, GitHub issues, community forums) via simple connectors or CSV uploads.
  2. Pre‑processing – Text is cleaned, language‑detected, and tokenized. A lightweight transformer model then produces sentence‑level embeddings.
  3. Request Classification – A fine‑tuned classification model (binary “feature request” vs. “bug report” vs. “other”) tags each item.
  4. NFR Identification – A second model scans the classified requests for quality attributes (performance, security, usability, etc.) using a multi‑label approach.
  5. Specification Generation – Using a prompt‑engineered generative LLM, RITA rewrites each request into a structured requirement template (e.g., “As a , I want so that ”).
  6. Export to Jira – The generated specs are mapped to Jira issue fields (summary, description, labels) and pushed via the Jira REST API.

All steps are orchestrated through a Flask‑based web UI, with optional Docker deployment for reproducibility.

Results & Findings

  • Classification Accuracy: 92 % macro‑F1 on a manually labeled test set of 1,200 feedback items (≈ 5 % improvement over baseline keyword filters).
  • NFR Detection: Multi‑label F1‑score of 0.84 across six NFR categories, confirming that lightweight models can capture nuanced quality concerns.
  • Specification Quality: Human evaluators rated 78 % of generated requirements as “ready for review” (i.e., needing only minor edits), compared to 45 % for a generic GPT‑3 baseline.
  • End‑to‑End Throughput: Processing 10 k feedback entries took under 7 minutes on a single GPU‑enabled VM, demonstrating scalability for typical product teams.

Practical Implications

  • Speed up backlog grooming – Teams can automatically surface high‑value feature requests and bugs, reducing manual triage time.
  • Consistent requirement language – By enforcing a template, RITA helps maintain a uniform style across tickets, easing downstream design and testing.
  • Integrates with existing toolchains – Direct Jira export means no disruption to agile pipelines; developers can start working on AI‑generated tickets immediately.
  • Cost‑effective RE – Using distilled LLMs keeps compute costs low (≈ $0.02 per 1 k tokens), making the solution viable for startups and mid‑size enterprises.
  • Feedback‑driven product roadmaps – Product managers can query the classification and NFR layers to spot trends (e.g., rising security concerns) and adjust priorities accordingly.

Limitations & Future Work

  • Domain Generality – The models were trained on generic app‑store data; performance may drop for highly specialized domains (e.g., medical devices) without additional fine‑tuning.
  • Explainability – While the UI shows confidence scores, the underlying LLM decisions remain a black box, which could hinder trust for safety‑critical requirements.
  • Multilingual Support – Current pipelines handle only English feedback; extending to other languages will require multilingual embeddings and prompts.
  • User Study – The paper reports a small‑scale human evaluation; larger longitudinal studies are needed to quantify impact on development velocity and defect rates.
  • Continuous Learning – Future versions could incorporate active learning loops where developers correct misclassifications, feeding the updates back into the models for on‑the‑fly improvement.

Authors

  • Manjeshwar Aniruddh Mallya
  • Alessio Ferrari
  • Mohammad Amin Zadenoori
  • Jacek Dąbrowski

Paper Information

  • arXiv ID: 2601.11362v1
  • Categories: cs.SE
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »