[Paper] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

Published: 3 weeks ago (December 29, 2025 at 03:04 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23280v1

Overview

The paper tackles a new, real‑world problem that’s exploding in China’s e‑commerce ecosystem: live‑stream hosts deliberately morph (i.e., mispronounce or disguise) product names and health claims to slip past platform moderation. The authors formalize this as the Live Auditory Morph Resolution (LiveAMR) task, release a large‑scale dataset of nearly 87 K annotated audio clips, and show how turning the problem into a text‑to‑text generation task—augmented with synthetic data from large language models (LLMs)—yields a practical detection pipeline.

Key Contributions

LiveAMR task definition – First formalization of pronunciation‑based morph detection in health‑related e‑commerce live streams.
LiveAMR dataset – 86,790 audio‑text pairs collected from popular Chinese platforms (e.g., Douyin), covering a wide range of morphing tactics.
Task reformulation – Convert morph detection into a text‑to‑text generation problem (input: ASR transcript; output: corrected “canonical” phrase).
LLM‑driven data augmentation – Use GPT‑4‑style models to synthesize realistic morph examples, boosting training data without costly manual labeling.
Empirical validation – Demonstrate that the generation‑based approach outperforms conventional classification and sequence‑labeling baselines, and that morph resolution improves downstream moderation accuracy.

Methodology

Data Collection & Annotation
- Scraped live‑stream recordings from Douyin’s health/medical channels.
- Applied an automatic speech recognizer (ASR) to obtain raw transcripts.
- Human annotators labeled each utterance as morphed or clean and provided the intended “canonical” phrase for morphed cases.
Task Reformulation
- Instead of binary classification, the model receives the noisy ASR transcript and is asked to generate the corrected phrase.
- This aligns with recent successes of encoder‑decoder LLMs on text‑to‑text tasks (e.g., translation, summarization).
Model Architecture
- Base: a Chinese‑pretrained encoder‑decoder model (e.g., mT5‑large).
- Fine‑tuned on the LiveAMR dataset with a standard seq2seq loss.
LLM‑Based Data Augmentation
- Prompted a powerful LLM to produce synthetic morph examples by:
  - Providing a clean phrase.
  - Asking the model to “morph” it using typical evasion patterns (e.g., homophones, tonal swaps, inserted filler sounds).
- Added ~200 K synthetic pairs to the training mix, balancing clean vs. morphed instances.
Evaluation
- Metrics: Exact Match (EM) of generated phrase, F1 on token‑level correction, and downstream moderation recall/precision when the generated phrase is fed to a rule‑based violator detector.

Results & Findings

Model	Exact Match	Token‑F1	Downstream Recall ↑	Downstream Precision ↑
Baseline classifier (binary)	–	–	68.2 %	71.5 %
Seq2Seq (no augmentation)	62.4 %	78.1 %	74.9 %	77.3 %
Seq2Seq + LLM augmentation	71.8 %	84.6 %	81.5 %	83.2 %

The generation‑based approach reduces false negatives (missed morphs) by >13 % compared with a pure classifier.
Adding synthetic morphs improves both generation quality and downstream moderation performance, confirming that LLMs can reliably mimic human morphing strategies.
Error analysis shows remaining challenges around extremely short utterances and heavily background‑noisy streams.

Practical Implications

Platform moderation pipelines can integrate the model as a pre‑processor: raw ASR → corrected phrase → existing rule‑based or ML violator detectors. This yields higher detection rates without overhauling downstream components.
Developer‑friendly API – The authors release a lightweight inference service (REST + gRPC) that accepts an audio clip, runs ASR, then the seq2seq morph resolver, returning the normalized text.
Scalable to other languages & domains – The same “text‑to‑text” reformulation can be adapted to English‑language livestreams (e.g., “pharma‑hype” on TikTok) or to other evasion tactics like visual watermark removal.
Cost‑effective data expansion – Using LLMs to generate adversarial examples reduces the need for large manual annotation campaigns, a model that can be replicated for any emerging moderation problem.

Limitations & Future Work

ASR dependency – Errors in the initial speech transcription propagate to the generation stage; improving ASR for noisy live streams is still needed.
Domain specificity – The dataset focuses on health/medical claims; morph patterns in other product categories may differ, requiring domain‑specific fine‑tuning.
Synthetic realism gap – While LLM‑generated morphs are diverse, they may not capture the full nuance of human improvisation (e.g., regional accents, spontaneous filler words). Future work could involve human‑in‑the‑loop generation or adversarial training with live streamers.
Real‑time constraints – Current inference latency (~300 ms per 5‑second clip) is acceptable for batch moderation but may need optimization for live, sub‑second flagging.

Authors

Jiahao Zhu
Jipeng Qiang
Ran Bai
Chenyu Liu
Xiaoye Ouyang

Paper Information

arXiv ID: 2512.23280v1
Categories: cs.CL
Published: December 29, 2025
PDF: Download PDF

[Paper] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents