[Paper] Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Published: (April 27, 2026 at 01:30 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.24720v1

Overview

The paper tackles a real‑world pain point for anyone building sentiment‑aware features on Indonesian e‑commerce platforms: reviews are riddled with slang, regional loanwords, numeric shortcuts, and emojis that break traditional lexicon‑based sentiment tools. By combining a classic TF‑IDF + AutoML pipeline with a modern multi‑task BiLSTM model, the authors deliver a robust solution that can simultaneously predict binary sentiment and a five‑class emotion label on a curated 5.4 k‑review dataset.

Key Contributions

  • Dual‑track classification pipeline – a lightweight TF‑IDF + AutoML baseline and a deep learning multi‑task BiLSTM that share an encoder for sentiment + emotion.
  • Comprehensive preprocessing suite – 14 sequential cleaning steps, including a custom 140‑entry slang dictionary built from marketplace corpora.
  • Extensive benchmarking – four model configurations (BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, TextCNN) evaluated against the AutoML track.
  • Open‑source and ready‑to‑use – full code, trained models, and interactive Gradio demos hosted on Hugging Face Spaces.
  • Practical training tricks – class‑weighted cross‑entropy, ReduceLROnPlateau scheduler, and early stopping to handle class imbalance and prevent over‑fitting.

Methodology

  1. Data – The PRDECT‑ID dataset contains 5,400 Indonesian product reviews, each annotated for (i) binary sentiment (Positive/Negative) and (ii) one of five emotions (Happy, Sad, Fear, Love, Anger).
  2. Preprocessing – Reviews undergo 14 cleaning operations: lower‑casing, URL/HTML removal, emoji conversion, numeric shorthand expansion, and slang normalization using the 140‑entry dictionary.
  3. Track 1 (AutoML) – TF‑IDF vectors feed into PyCaret’s automated model search, which evaluates a suite of classical classifiers (Logistic Regression, Random Forest, XGBoost, etc.) and selects the best based on cross‑validation scores.
  4. Track 2 (Multi‑task BiLSTM) – A PyTorch BiLSTM encoder processes tokenized text. The shared hidden representation is fed into two separate fully‑connected heads: one for sentiment (binary) and one for emotion (5‑way). Variants differ in hidden size, number of layers, and dropout.
  5. Training tricks – Losses are weighted by inverse class frequency, the learning rate is reduced on plateau, and early stopping halts training when validation loss stops improving.

Results & Findings

ModelSentiment Acc.Emotion F1 (macro)
TF‑IDF + AutoML84.2 %62.7 %
BiLSTM Baseline83.5 %66.1 %
BiLSTM Improved84.0 %65.8 %
BiLSTM Large84.3 %66.0 %
TextCNN82.9 %64.5 %
  • The AutoML track wins on pure sentiment accuracy thanks to its ensemble of strong classical models.
  • The multi‑task BiLSTM consistently outperforms the baseline on emotion classification, showing that a shared encoder can capture nuanced affective cues.
  • Scaling the BiLSTM (more layers/units) yields marginal gains, indicating diminishing returns beyond a certain model size for this dataset.

Practical Implications

  • Plug‑and‑play sentiment/emotion APIs – Developers can spin up the provided Gradio demo or pull the Hugging Face model to add real‑time sentiment and emotion detection to recommendation engines, review moderation tools, or chatbots targeting Indonesian users.
  • Cost‑effective baseline – The TF‑IDF + AutoML pipeline runs on CPU with minimal latency, making it suitable for edge devices or low‑budget services.
  • Improved customer insights – Emotion labels (e.g., “Fear” vs. “Anger”) enable more granular sentiment analytics, helping marketers tailor responses or prioritize support tickets.
  • Transferable preprocessing – The slang dictionary and cleaning steps can be reused for other Indonesian NLP tasks (topic modeling, intent detection) where informal language is prevalent.

Limitations & Future Work

  • Dataset size – 5.4 k reviews is modest; larger, more diverse corpora could expose scalability issues and improve generalization.
  • Language coverage – The slang dictionary, while useful, captures only a fraction of the ever‑evolving marketplace vernacular; continuous updates are needed.
  • Emotion granularity – Only five emotion classes were considered; future work could explore a richer affective taxonomy or multi‑label emotion detection.
  • Cross‑lingual extension – Adapting the pipeline to other low‑resource languages with similar informal text patterns would test its robustness beyond Indonesian.

All code, models, and the interactive demo are openly available at the authors’ GitHub repository and Hugging Face Spaces, so you can start experimenting right away.

Authors

  • Hermawan Manurung
  • Ibrahim Al‑Kahfi
  • Ahmad Rizqi
  • Martin Clinton Tosima Manullang

Paper Information

  • arXiv ID: 2604.24720v1
  • Categories: cs.CL
  • Published: April 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...

[Paper] A paradox of AI fluency

How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, b...