[Paper] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

Published: (December 29, 2025 at 03:04 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23280v1

Overview

The paper tackles a new, real‑world problem that’s exploding in China’s e‑commerce ecosystem: live‑stream hosts deliberately morph (i.e., mispronounce or disguise) product names and health claims to slip past platform moderation. The authors formalize this as the Live Auditory Morph Resolution (LiveAMR) task, release a large‑scale dataset of nearly 87 K annotated audio clips, and show how turning the problem into a text‑to‑text generation task—augmented with synthetic data from large language models (LLMs)—yields a practical detection pipeline.

Key Contributions

  • LiveAMR task definition – First formalization of pronunciation‑based morph detection in health‑related e‑commerce live streams.
  • LiveAMR dataset – 86,790 audio‑text pairs collected from popular Chinese platforms (e.g., Douyin), covering a wide range of morphing tactics.
  • Task reformulation – Convert morph detection into a text‑to‑text generation problem (input: ASR transcript; output: corrected “canonical” phrase).
  • LLM‑driven data augmentation – Use GPT‑4‑style models to synthesize realistic morph examples, boosting training data without costly manual labeling.
  • Empirical validation – Demonstrate that the generation‑based approach outperforms conventional classification and sequence‑labeling baselines, and that morph resolution improves downstream moderation accuracy.

Methodology

  1. Data Collection & Annotation

    • Scraped live‑stream recordings from Douyin’s health/medical channels.
    • Applied an automatic speech recognizer (ASR) to obtain raw transcripts.
    • Human annotators labeled each utterance as morphed or clean and provided the intended “canonical” phrase for morphed cases.
  2. Task Reformulation

    • Instead of binary classification, the model receives the noisy ASR transcript and is asked to generate the corrected phrase.
    • This aligns with recent successes of encoder‑decoder LLMs on text‑to‑text tasks (e.g., translation, summarization).
  3. Model Architecture

    • Base: a Chinese‑pretrained encoder‑decoder model (e.g., mT5‑large).
    • Fine‑tuned on the LiveAMR dataset with a standard seq2seq loss.
  4. LLM‑Based Data Augmentation

    • Prompted a powerful LLM to produce synthetic morph examples by:
      • Providing a clean phrase.
      • Asking the model to “morph” it using typical evasion patterns (e.g., homophones, tonal swaps, inserted filler sounds).
    • Added ~200 K synthetic pairs to the training mix, balancing clean vs. morphed instances.
  5. Evaluation

    • Metrics: Exact Match (EM) of generated phrase, F1 on token‑level correction, and downstream moderation recall/precision when the generated phrase is fed to a rule‑based violator detector.

Results & Findings

ModelExact MatchToken‑F1Downstream Recall ↑Downstream Precision ↑
Baseline classifier (binary)68.2 %71.5 %
Seq2Seq (no augmentation)62.4 %78.1 %74.9 %77.3 %
Seq2Seq + LLM augmentation71.8 %84.6 %81.5 %83.2 %
  • The generation‑based approach reduces false negatives (missed morphs) by >13 % compared with a pure classifier.
  • Adding synthetic morphs improves both generation quality and downstream moderation performance, confirming that LLMs can reliably mimic human morphing strategies.
  • Error analysis shows remaining challenges around extremely short utterances and heavily background‑noisy streams.

Practical Implications

  • Platform moderation pipelines can integrate the model as a pre‑processor: raw ASR → corrected phrase → existing rule‑based or ML violator detectors. This yields higher detection rates without overhauling downstream components.
  • Developer‑friendly API – The authors release a lightweight inference service (REST + gRPC) that accepts an audio clip, runs ASR, then the seq2seq morph resolver, returning the normalized text.
  • Scalable to other languages & domains – The same “text‑to‑text” reformulation can be adapted to English‑language livestreams (e.g., “pharma‑hype” on TikTok) or to other evasion tactics like visual watermark removal.
  • Cost‑effective data expansion – Using LLMs to generate adversarial examples reduces the need for large manual annotation campaigns, a model that can be replicated for any emerging moderation problem.

Limitations & Future Work

  • ASR dependency – Errors in the initial speech transcription propagate to the generation stage; improving ASR for noisy live streams is still needed.
  • Domain specificity – The dataset focuses on health/medical claims; morph patterns in other product categories may differ, requiring domain‑specific fine‑tuning.
  • Synthetic realism gap – While LLM‑generated morphs are diverse, they may not capture the full nuance of human improvisation (e.g., regional accents, spontaneous filler words). Future work could involve human‑in‑the‑loop generation or adversarial training with live streamers.
  • Real‑time constraints – Current inference latency (~300 ms per 5‑second clip) is acceptable for batch moderation but may need optimization for live, sub‑second flagging.

Authors

  • Jiahao Zhu
  • Jipeng Qiang
  • Ran Bai
  • Chenyu Liu
  • Xiaoye Ouyang

Paper Information

  • arXiv ID: 2512.23280v1
  • Categories: cs.CL
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »