[Paper] Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification

Published: 3 days ago (February 24, 2026 at 11:45 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21082v1

Overview

The paper presents a pragmatic, two‑stage pipeline that blends the deep language understanding of large language models (LLMs) with the speed and low cost of classic text‑classification algorithms. By using an LLM (ChatGPT) to discover aspects (e.g., “service”, “food quality”) in a small, human‑validated sample of restaurant reviews, the authors then train lightweight sentiment classifiers that can be run over 4.7 M reviews spanning 17 years. The result is a scalable, cost‑effective solution for aspect‑based sentiment analysis (ABSA) that works at web‑scale.

Key Contributions

Hybrid ABSA framework: LLM‑driven aspect extraction + traditional machine‑learning sentiment classifiers for massive datasets.
Scalable pipeline: Demonstrated end‑to‑end processing of millions of reviews on commodity hardware (no GPU‑heavy inference on the full corpus).
Empirical validation: Regression analysis shows that machine‑labeled aspect sentiments explain a large share of variance in overall star ratings across cuisines, regions, and time.
Open‑source reproducibility: The authors release code, aspect dictionaries, and the labeled 4.7 M review dataset for the community.
Cross‑domain blueprint: The approach is positioned as a template for any service‑oriented sector (e.g., hotels, e‑commerce, SaaS support tickets).

Methodology

Data Sampling & Human Annotation
- Randomly sampled ~2 k restaurant reviews from a major platform.
- Human annotators labeled each review with the aspect it discussed and the corresponding sentiment (positive/negative/neutral).
Aspect Identification with an LLM
- Prompted ChatGPT to generate a concise list of recurring aspects from the annotated sample.
- The LLM also produced a set of regex‑style “aspect cues” (e.g., “waiter”, “service”, “ambiance”) that were later used for rule‑based tagging.
Training Lightweight Sentiment Classifiers
- For each discovered aspect, a binary (positive/negative) classifier was trained using classic algorithms (Logistic Regression, Linear SVM) on TF‑IDF vectors of the human‑labeled subset.
- Hyper‑parameters were tuned via cross‑validation; models were saved for batch inference.
Mass‑Scale Inference
- The aspect cue dictionary was applied to the full 4.7 M review corpus to assign aspect tags.
- Corresponding sentiment classifiers were run on each tagged segment, producing aspect‑level sentiment scores at negligible computational cost.
Statistical Analysis
- Multi‑level linear regression linked aspect sentiment scores to the overall star rating, controlling for cuisine type, city, and year.

Results & Findings

Metric	Value
Aspect coverage	92 % of reviews contained at least one of the 12 discovered aspects.
Sentiment classifier accuracy	84 % (macro‑averaged) on a held‑out human‑labeled test set.
Regression R²	0.71 overall; individual aspects (e.g., “food quality” R² = 0.48, “service” R² = 0.33).
Cross‑region consistency	Aspect‑sentiment coefficients remained stable across 5 major US cities (Δ < 0.05).
Computation cost	Full pipeline processed 4.7 M reviews in ~12 h on a 16‑core CPU node (≈ $0.35 / M reviews).

Interpretation:

The LLM‑derived aspect list captures the majority of meaningful discussion points in restaurant reviews.
Traditional classifiers, once trained on a modest human‑labeled set, can reliably predict sentiment for each aspect at scale.
Aspect‑level sentiment explains most of the variance in overall star ratings, confirming that customers evaluate restaurants through a handful of concrete dimensions.

Practical Implications

Product managers & data engineers can adopt the pipeline to enrich review dashboards with granular sentiment signals (e.g., “service: ‑0.8”) without incurring massive LLM inference bills.
Recommendation engines can weight items not just by overall rating but by aspect strengths (e.g., “great food, mediocre service”), enabling more nuanced personalization.
Operational teams (restaurant chains, hotel groups) can monitor aspect trends over time to pinpoint operational bottlenecks (e.g., a dip in “wait time” sentiment).
SaaS support can replace generic ticket sentiment scores with aspect‑specific insights (e.g., “UI usability”, “response time”), driving targeted product improvements.
Open‑source community gains a reusable template: replace the LLM prompt with a domain‑specific one, retrain the lightweight classifiers, and scale to any textual feedback corpus.

Limitations & Future Work

Aspect discovery relies on a single LLM (ChatGPT) and a small sample; rare or emerging aspects may be missed.
Binary sentiment labels ignore intensity (e.g., “very good” vs. “good”), which could improve rating prediction.
Domain transfer: While the framework is generic, the cue dictionary and classifier performance may degrade when moving to non‑restaurant domains without re‑annotation.
Temporal drift: Language usage evolves; periodic re‑training of classifiers and updating of aspect cues will be needed.
Future directions suggested by the authors include: (1) incorporating few‑shot prompting to capture niche aspects, (2) exploring multi‑class or regression‑style sentiment outputs, and (3) extending the pipeline to multimodal data (photos, star emojis).

Authors

Vishal Patil
Shree Vaishnavi Bacha
Revanth Yamani
Yidan Sun
Mayank Kejriwal

Paper Information

arXiv ID: 2602.21082v1
Categories: cs.CL
Published: February 24, 2026
PDF: Download PDF

[Paper] Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables