[Paper] Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification

Published: (February 24, 2026 at 11:45 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.21082v1

Overview

The paper presents a pragmatic, two‑stage pipeline that blends the deep language understanding of large language models (LLMs) with the speed and low cost of classic text‑classification algorithms. By using an LLM (ChatGPT) to discover aspects (e.g., “service”, “food quality”) in a small, human‑validated sample of restaurant reviews, the authors then train lightweight sentiment classifiers that can be run over 4.7 M reviews spanning 17 years. The result is a scalable, cost‑effective solution for aspect‑based sentiment analysis (ABSA) that works at web‑scale.

Key Contributions

  • Hybrid ABSA framework: LLM‑driven aspect extraction + traditional machine‑learning sentiment classifiers for massive datasets.
  • Scalable pipeline: Demonstrated end‑to‑end processing of millions of reviews on commodity hardware (no GPU‑heavy inference on the full corpus).
  • Empirical validation: Regression analysis shows that machine‑labeled aspect sentiments explain a large share of variance in overall star ratings across cuisines, regions, and time.
  • Open‑source reproducibility: The authors release code, aspect dictionaries, and the labeled 4.7 M review dataset for the community.
  • Cross‑domain blueprint: The approach is positioned as a template for any service‑oriented sector (e.g., hotels, e‑commerce, SaaS support tickets).

Methodology

  1. Data Sampling & Human Annotation

    • Randomly sampled ~2 k restaurant reviews from a major platform.
    • Human annotators labeled each review with the aspect it discussed and the corresponding sentiment (positive/negative/neutral).
  2. Aspect Identification with an LLM

    • Prompted ChatGPT to generate a concise list of recurring aspects from the annotated sample.
    • The LLM also produced a set of regex‑style “aspect cues” (e.g., “waiter”, “service”, “ambiance”) that were later used for rule‑based tagging.
  3. Training Lightweight Sentiment Classifiers

    • For each discovered aspect, a binary (positive/negative) classifier was trained using classic algorithms (Logistic Regression, Linear SVM) on TF‑IDF vectors of the human‑labeled subset.
    • Hyper‑parameters were tuned via cross‑validation; models were saved for batch inference.
  4. Mass‑Scale Inference

    • The aspect cue dictionary was applied to the full 4.7 M review corpus to assign aspect tags.
    • Corresponding sentiment classifiers were run on each tagged segment, producing aspect‑level sentiment scores at negligible computational cost.
  5. Statistical Analysis

    • Multi‑level linear regression linked aspect sentiment scores to the overall star rating, controlling for cuisine type, city, and year.

Results & Findings

MetricValue
Aspect coverage92 % of reviews contained at least one of the 12 discovered aspects.
Sentiment classifier accuracy84 % (macro‑averaged) on a held‑out human‑labeled test set.
Regression R²0.71 overall; individual aspects (e.g., “food quality” R² = 0.48, “service” R² = 0.33).
Cross‑region consistencyAspect‑sentiment coefficients remained stable across 5 major US cities (Δ < 0.05).
Computation costFull pipeline processed 4.7 M reviews in ~12 h on a 16‑core CPU node (≈ $0.35 / M reviews).

Interpretation:

  • The LLM‑derived aspect list captures the majority of meaningful discussion points in restaurant reviews.
  • Traditional classifiers, once trained on a modest human‑labeled set, can reliably predict sentiment for each aspect at scale.
  • Aspect‑level sentiment explains most of the variance in overall star ratings, confirming that customers evaluate restaurants through a handful of concrete dimensions.

Practical Implications

  • Product managers & data engineers can adopt the pipeline to enrich review dashboards with granular sentiment signals (e.g., “service: ‑0.8”) without incurring massive LLM inference bills.
  • Recommendation engines can weight items not just by overall rating but by aspect strengths (e.g., “great food, mediocre service”), enabling more nuanced personalization.
  • Operational teams (restaurant chains, hotel groups) can monitor aspect trends over time to pinpoint operational bottlenecks (e.g., a dip in “wait time” sentiment).
  • SaaS support can replace generic ticket sentiment scores with aspect‑specific insights (e.g., “UI usability”, “response time”), driving targeted product improvements.
  • Open‑source community gains a reusable template: replace the LLM prompt with a domain‑specific one, retrain the lightweight classifiers, and scale to any textual feedback corpus.

Limitations & Future Work

  • Aspect discovery relies on a single LLM (ChatGPT) and a small sample; rare or emerging aspects may be missed.
  • Binary sentiment labels ignore intensity (e.g., “very good” vs. “good”), which could improve rating prediction.
  • Domain transfer: While the framework is generic, the cue dictionary and classifier performance may degrade when moving to non‑restaurant domains without re‑annotation.
  • Temporal drift: Language usage evolves; periodic re‑training of classifiers and updating of aspect cues will be needed.
  • Future directions suggested by the authors include: (1) incorporating few‑shot prompting to capture niche aspects, (2) exploring multi‑class or regression‑style sentiment outputs, and (3) extending the pipeline to multimodal data (photos, star emojis).

Authors

  • Vishal Patil
  • Shree Vaishnavi Bacha
  • Revanth Yamani
  • Yidan Sun
  • Mayank Kejriwal

Paper Information

  • arXiv ID: 2602.21082v1
  • Categories: cs.CL
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »