[Paper] Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Published: 1 day ago (April 27, 2026 at 01:30 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.24720v1

Overview

The paper tackles a real‑world pain point for anyone building sentiment‑aware features on Indonesian e‑commerce platforms: reviews are riddled with slang, regional loanwords, numeric shortcuts, and emojis that break traditional lexicon‑based sentiment tools. By combining a classic TF‑IDF + AutoML pipeline with a modern multi‑task BiLSTM model, the authors deliver a robust solution that can simultaneously predict binary sentiment and a five‑class emotion label on a curated 5.4 k‑review dataset.

Key Contributions

Dual‑track classification pipeline – a lightweight TF‑IDF + AutoML baseline and a deep learning multi‑task BiLSTM that share an encoder for sentiment + emotion.
Comprehensive preprocessing suite – 14 sequential cleaning steps, including a custom 140‑entry slang dictionary built from marketplace corpora.
Extensive benchmarking – four model configurations (BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, TextCNN) evaluated against the AutoML track.
Open‑source and ready‑to‑use – full code, trained models, and interactive Gradio demos hosted on Hugging Face Spaces.
Practical training tricks – class‑weighted cross‑entropy, ReduceLROnPlateau scheduler, and early stopping to handle class imbalance and prevent over‑fitting.

Methodology

Data – The PRDECT‑ID dataset contains 5,400 Indonesian product reviews, each annotated for (i) binary sentiment (Positive/Negative) and (ii) one of five emotions (Happy, Sad, Fear, Love, Anger).
Preprocessing – Reviews undergo 14 cleaning operations: lower‑casing, URL/HTML removal, emoji conversion, numeric shorthand expansion, and slang normalization using the 140‑entry dictionary.
Track 1 (AutoML) – TF‑IDF vectors feed into PyCaret’s automated model search, which evaluates a suite of classical classifiers (Logistic Regression, Random Forest, XGBoost, etc.) and selects the best based on cross‑validation scores.
Track 2 (Multi‑task BiLSTM) – A PyTorch BiLSTM encoder processes tokenized text. The shared hidden representation is fed into two separate fully‑connected heads: one for sentiment (binary) and one for emotion (5‑way). Variants differ in hidden size, number of layers, and dropout.
Training tricks – Losses are weighted by inverse class frequency, the learning rate is reduced on plateau, and early stopping halts training when validation loss stops improving.

Results & Findings

Model	Sentiment Acc.	Emotion F1 (macro)
TF‑IDF + AutoML	84.2 %	62.7 %
BiLSTM Baseline	83.5 %	66.1 %
BiLSTM Improved	84.0 %	65.8 %
BiLSTM Large	84.3 %	66.0 %
TextCNN	82.9 %	64.5 %

The AutoML track wins on pure sentiment accuracy thanks to its ensemble of strong classical models.
The multi‑task BiLSTM consistently outperforms the baseline on emotion classification, showing that a shared encoder can capture nuanced affective cues.
Scaling the BiLSTM (more layers/units) yields marginal gains, indicating diminishing returns beyond a certain model size for this dataset.

Practical Implications

Plug‑and‑play sentiment/emotion APIs – Developers can spin up the provided Gradio demo or pull the Hugging Face model to add real‑time sentiment and emotion detection to recommendation engines, review moderation tools, or chatbots targeting Indonesian users.
Cost‑effective baseline – The TF‑IDF + AutoML pipeline runs on CPU with minimal latency, making it suitable for edge devices or low‑budget services.
Improved customer insights – Emotion labels (e.g., “Fear” vs. “Anger”) enable more granular sentiment analytics, helping marketers tailor responses or prioritize support tickets.
Transferable preprocessing – The slang dictionary and cleaning steps can be reused for other Indonesian NLP tasks (topic modeling, intent detection) where informal language is prevalent.

Limitations & Future Work

Dataset size – 5.4 k reviews is modest; larger, more diverse corpora could expose scalability issues and improve generalization.
Language coverage – The slang dictionary, while useful, captures only a fraction of the ever‑evolving marketplace vernacular; continuous updates are needed.
Emotion granularity – Only five emotion classes were considered; future work could explore a richer affective taxonomy or multi‑label emotion detection.
Cross‑lingual extension – Adapting the pipeline to other low‑resource languages with similar informal text patterns would test its robustness beyond Indonesian.

All code, models, and the interactive demo are openly available at the authors’ GitHub repository and Hugging Face Spaces, so you can start experimenting right away.

Authors

Hermawan Manurung
Ibrahim Al‑Kahfi
Ahmad Rizqi
Martin Clinton Tosima Manullang

Paper Information

arXiv ID: 2604.24720v1
Categories: cs.CL
Published: April 27, 2026
PDF: Download PDF

[Paper] Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

[Paper] A paradox of AI fluency

[Paper] Toward a Functional Geometric Algebra for Natural Language Semantics