[Paper] A Dataset of Low-Rated Applications from the Amazon Appstore for User Feedback Analysis

Published: (January 6, 2026 at 08:32 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.03009v1

Overview

The authors present a new, publicly‑available dataset that focuses on low‑rated Android apps from the Amazon Appstore. By harvesting ≈ 80 k user reviews and manually annotating 6 k of them into six concrete issue categories, the work shines a light on the “negative” side of app feedback—information that is often ignored but is rich with clues for fixing bugs, improving UX, and boosting ratings.

Key Contributions

  • First large‑scale low‑rating dataset for Android apps (64 apps, 79 821 reviews).
  • Manual annotation of 6 000 reviews into six well‑defined issue types: UI/UX, functionality, compatibility, performance/stability, support, and security/privacy.
  • Open‑source release of both raw and annotated data, enabling reproducibility and downstream research.
  • Baseline analysis of issue frequency and distribution, establishing a reference point for future studies.
  • Framework for automated feedback classification, paving the way for ML models that can triage negative reviews at scale.

Methodology

  1. App selection – Queried the Amazon Software Appstore for apps with an average rating ≤ 2.5 stars, resulting in 64 distinct applications across various categories (games, utilities, etc.).
  2. Review collection – Scraped all available user reviews via the store’s public API, yielding 79 821 textual entries.
  3. Issue taxonomy design – Built on prior work to define six high‑level issue categories that capture the most common pain points in low‑rated apps.
  4. Manual annotation – Six thousand reviews were independently labeled by domain experts; inter‑annotator agreement measured (Cohen’s κ ≈ 0.78) ensures reliable ground truth.
  5. Dataset packaging – Released the raw JSON dump and a CSV file containing the annotated subset (review text, app ID, rating, issue label) under a permissive license.

Results & Findings

  • Issue distribution: Performance/stability (≈ 28 %) and UI/UX (≈ 24 %) were most prevalent, followed by functionality (≈ 18 %). Security/privacy and support issues were rarer but present.
  • Review length & sentiment: Low‑rated reviews tended to be shorter and more emotionally charged (higher incidence of sarcasm and negative sentiment) compared to high‑rated counterparts reported in prior studies.
  • Cross‑app patterns: Certain issue types (e.g., crashes on specific device models) recurred across multiple apps, suggesting systemic compatibility challenges in the Amazon ecosystem.
  • Baseline classification: A simple TF‑IDF + Logistic Regression model achieved ~71 % accuracy on the 6 k annotated set, confirming that the taxonomy is learnable and that the dataset can serve as a benchmark for more sophisticated deep‑learning approaches.

Practical Implications

  • Automated triage pipelines – Developers can integrate a trained classifier into CI/CD or release‑monitoring tools to flag incoming negative reviews and route them to the appropriate engineering team (UI, backend, security, etc.).
  • Prioritization of bug fixes – By quantifying the share of performance vs. UI complaints, product managers can allocate resources where they’ll have the biggest impact on rating recovery.
  • Competitive intelligence – Vendors can benchmark their own low‑rated apps against the dataset to identify common failure modes, informing cross‑app remediation strategies.
  • Enhanced app‑store moderation – Store operators (Amazon, Google Play) could use the dataset to train moderation bots that detect abusive language, sarcasm, or privacy‑related allegations, improving user trust.
  • Research acceleration – The open dataset lowers the entry barrier for work on sentiment analysis, sarcasm detection, and software evolution studies that specifically target the “negative feedback” niche.

Limitations & Future Work

  • Platform scope – Limited to the Amazon Appstore; Android apps from Google Play may exhibit different review patterns.
  • Temporal bias – Reviews were collected at a single point in time; app updates could shift issue distributions, so longitudinal studies are needed.
  • Annotation granularity – Six coarse categories capture major themes but may miss finer‑grained nuances (e.g., network latency vs. battery drain). Future work could expand the taxonomy or adopt hierarchical labeling.
  • Model baselines – Only simple classifiers were evaluated; exploring transformer‑based models, multimodal inputs (ratings, timestamps), and transfer learning could boost classification performance.

Authors

  • Nek Dil Khan
  • Javed Ali Khan
  • Darvesh Khan
  • Jianqiang Li
  • Mumrez Khan
  • Shah Fahad Khan

Paper Information

  • arXiv ID: 2601.03009v1
  • Categories: cs.SE
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »