[Paper] A Dataset of Low-Rated Applications from the Amazon Appstore for User Feedback Analysis

Published: 1 month ago (January 6, 2026 at 08:32 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.03009v1

Overview

The authors present a new, publicly‑available dataset that focuses on low‑rated Android apps from the Amazon Appstore. By harvesting ≈ 80 k user reviews and manually annotating 6 k of them into six concrete issue categories, the work shines a light on the “negative” side of app feedback—information that is often ignored but is rich with clues for fixing bugs, improving UX, and boosting ratings.

Key Contributions

First large‑scale low‑rating dataset for Android apps (64 apps, 79 821 reviews).
Manual annotation of 6 000 reviews into six well‑defined issue types: UI/UX, functionality, compatibility, performance/stability, support, and security/privacy.
Open‑source release of both raw and annotated data, enabling reproducibility and downstream research.
Baseline analysis of issue frequency and distribution, establishing a reference point for future studies.
Framework for automated feedback classification, paving the way for ML models that can triage negative reviews at scale.

Methodology

App selection – Queried the Amazon Software Appstore for apps with an average rating ≤ 2.5 stars, resulting in 64 distinct applications across various categories (games, utilities, etc.).
Review collection – Scraped all available user reviews via the store’s public API, yielding 79 821 textual entries.
Issue taxonomy design – Built on prior work to define six high‑level issue categories that capture the most common pain points in low‑rated apps.
Manual annotation – Six thousand reviews were independently labeled by domain experts; inter‑annotator agreement measured (Cohen’s κ ≈ 0.78) ensures reliable ground truth.
Dataset packaging – Released the raw JSON dump and a CSV file containing the annotated subset (review text, app ID, rating, issue label) under a permissive license.

Results & Findings

Issue distribution: Performance/stability (≈ 28 %) and UI/UX (≈ 24 %) were most prevalent, followed by functionality (≈ 18 %). Security/privacy and support issues were rarer but present.
Review length & sentiment: Low‑rated reviews tended to be shorter and more emotionally charged (higher incidence of sarcasm and negative sentiment) compared to high‑rated counterparts reported in prior studies.
Cross‑app patterns: Certain issue types (e.g., crashes on specific device models) recurred across multiple apps, suggesting systemic compatibility challenges in the Amazon ecosystem.
Baseline classification: A simple TF‑IDF + Logistic Regression model achieved ~71 % accuracy on the 6 k annotated set, confirming that the taxonomy is learnable and that the dataset can serve as a benchmark for more sophisticated deep‑learning approaches.

Practical Implications

Automated triage pipelines – Developers can integrate a trained classifier into CI/CD or release‑monitoring tools to flag incoming negative reviews and route them to the appropriate engineering team (UI, backend, security, etc.).
Prioritization of bug fixes – By quantifying the share of performance vs. UI complaints, product managers can allocate resources where they’ll have the biggest impact on rating recovery.
Competitive intelligence – Vendors can benchmark their own low‑rated apps against the dataset to identify common failure modes, informing cross‑app remediation strategies.
Enhanced app‑store moderation – Store operators (Amazon, Google Play) could use the dataset to train moderation bots that detect abusive language, sarcasm, or privacy‑related allegations, improving user trust.
Research acceleration – The open dataset lowers the entry barrier for work on sentiment analysis, sarcasm detection, and software evolution studies that specifically target the “negative feedback” niche.

Limitations & Future Work

Platform scope – Limited to the Amazon Appstore; Android apps from Google Play may exhibit different review patterns.
Temporal bias – Reviews were collected at a single point in time; app updates could shift issue distributions, so longitudinal studies are needed.
Annotation granularity – Six coarse categories capture major themes but may miss finer‑grained nuances (e.g., network latency vs. battery drain). Future work could expand the taxonomy or adopt hierarchical labeling.
Model baselines – Only simple classifiers were evaluated; exploring transformer‑based models, multimodal inputs (ratings, timestamps), and transfer learning could boost classification performance.

Authors

Nek Dil Khan
Javed Ali Khan
Darvesh Khan
Jianqiang Li
Mumrez Khan
Shah Fahad Khan

Paper Information

arXiv ID: 2601.03009v1
Categories: cs.SE
Published: January 6, 2026
PDF: Download PDF

[Paper] A Dataset of Low-Rated Applications from the Amazon Appstore for User Feedback Analysis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SSR: Safeguarding Staking Rewards by Defining and Detecting Logical Defects in DeFi Staking

[Paper] EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents

[Paper] StriderSPD: Structure-Guided Joint Representation Learning for Binary Security Patch Detection

[Paper] From Issues to Insights: RAG-based Explanation Generation from Software Engineering Artifacts