[Paper] VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications
Source: arXiv - 2603.04145v1
Overview
The paper introduces VietNormalizer, a lightweight, zero‑dependency Python library that turns messy Vietnamese text—full of numbers, dates, acronyms, emojis, and foreign words—into clean, fully pronounceable sentences. By handling the “non‑standard words” that break Text‑to‑Speech (TTS) pipelines and many NLP models, the library fills a long‑standing gap for developers building Vietnamese language products.
Key Contributions
- Open‑source, pip‑installable library with no external dependencies (no heavy neural models, no GPU required).
- Comprehensive rule‑based pipeline covering seven major NSW classes: integers/decimals, dates/times, VND & USD amounts, percentages, acronyms, loanwords/foreign terms, and Unicode/emoji cleanup.
- High‑throughput design: all regex patterns are pre‑compiled at import time, enabling fast batch processing with minimal memory footprint.
- Customizable acronym dictionary (CSV) and extensible transliteration rules, allowing developers to adapt the system to domain‑specific vocabularies.
- MIT‑licensed and hosted on PyPI/GitHub, encouraging community contributions and easy integration into existing TTS/NLP stacks.
Methodology
VietNormalizer follows a purely rule‑based approach that sidesteps the need for large language models:
- Pre‑compilation – At library initialization, every regular‑expression pattern (e.g., number detection, date formats) is compiled once, avoiding runtime recompilation overhead.
- Sequential processing pipeline – Input text passes through a series of deterministic modules:
- Unicode normalization (NFC/NFKC) and removal of emojis/special symbols.
- Number conversion – Handles arbitrary‑length integers, floating‑point numbers, and large magnitudes (thousands, millions, billions) by mapping digits to Vietnamese words.
- Date/Time handling – Recognizes common Vietnamese and ISO date formats, converting them to spoken forms (e.g., “12/03/2024” → “mười hai tháng ba năm hai không bốn”).
- Currency & percentages – Detects VND/USD symbols and percentage signs, expanding them with appropriate units.
- Acronym expansion – Looks up tokens in a user‑provided CSV dictionary (e.g., “AI” → “trí tuệ nhân tạo”).
- Transliteration – Applies a handcrafted mapping table to approximate the phonetics of foreign loanwords (e.g., “Google” → “gu-gồ”).
- Output – Returns a normalized string ready for downstream TTS synthesis or NLP tokenization.
Because the pipeline is deterministic, developers can debug, extend, or reorder modules without worrying about stochastic model behavior.
Results & Findings
- Speed – Benchmarks on a standard laptop (Intel i7, 16 GB RAM) show processing of ≈ 10,000 sentences per second, far outpacing neural‑based normalizers that require GPU inference.
- Memory – The library stays under 30 MB of RAM after loading, making it suitable for edge devices or serverless functions.
- Coverage – In a curated test set of 5 k real‑world Vietnamese sentences (social media, news, and TTS scripts), VietNormalizer correctly normalized ≈ 96 % of NSW instances, outperforming existing open‑source tools that typically handle only 60–70 % of the same categories.
- Error analysis highlighted a few edge cases (e.g., ambiguous date formats like “01/02/03”) that require contextual disambiguation, which rule‑based logic alone cannot resolve.
Practical Implications
- TTS pipelines can now ingest raw user‑generated content (comments, chat logs) without a separate preprocessing step, reducing latency and simplifying deployment.
- Voice assistants targeting Vietnamese markets can reliably read out numbers, dates, and foreign brand names, improving user experience.
- NLP tasks such as sentiment analysis, named‑entity recognition, or machine translation benefit from a cleaner token stream, leading to higher downstream accuracy.
- Serverless or mobile apps can embed the library directly (thanks to its tiny footprint), avoiding costly model downloads and GPU requirements.
- Rapid prototyping – Data scientists can plug VietNormalizer into Jupyter notebooks with a single
pip installcommand, accelerating experimentation on Vietnamese corpora.
Limitations & Future Work
- The rule‑based system cannot resolve ambiguous contexts (e.g., “03/04/05” could be a date or a version number) without additional linguistic cues.
- Domain‑specific slang or newly coined acronyms require manual dictionary updates; the library does not learn new patterns automatically.
- Transliteration rules are hand‑crafted and may not capture all phonetic nuances of emerging loanwords.
- The authors suggest extending the framework with lightweight statistical disambiguation (e.g., a small CRF model) and exploring cross‑language transfer to other low‑resource tonal languages such as Thai or Burmese.
VietNormalizer demonstrates that a well‑engineered, dependency‑free rule‑based approach can meet the demanding real‑time needs of modern Vietnamese TTS and NLP applications, offering developers a practical tool that bridges the gap between raw user text and high‑quality language processing.
Authors
- Hung Vu Nguyen
- Loan Do
- Thanh Ngoc Nguyen
- Ushik Shrestha Khwakhali
- Thanh Pham
- Vinh Do
- Charlotte Nguyen
- Hien Nguyen
Paper Information
- arXiv ID: 2603.04145v1
- Categories: cs.CL, cs.NE
- Published: March 4, 2026
- PDF: Download PDF