[Paper] VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications

Published: 1 day ago (March 4, 2026 at 09:58 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04145v1

Overview

The paper introduces VietNormalizer, a lightweight, zero‑dependency Python library that turns messy Vietnamese text—full of numbers, dates, acronyms, emojis, and foreign words—into clean, fully pronounceable sentences. By handling the “non‑standard words” that break Text‑to‑Speech (TTS) pipelines and many NLP models, the library fills a long‑standing gap for developers building Vietnamese language products.

Key Contributions

Open‑source, pip‑installable library with no external dependencies (no heavy neural models, no GPU required).
Comprehensive rule‑based pipeline covering seven major NSW classes: integers/decimals, dates/times, VND & USD amounts, percentages, acronyms, loanwords/foreign terms, and Unicode/emoji cleanup.
High‑throughput design: all regex patterns are pre‑compiled at import time, enabling fast batch processing with minimal memory footprint.
Customizable acronym dictionary (CSV) and extensible transliteration rules, allowing developers to adapt the system to domain‑specific vocabularies.
MIT‑licensed and hosted on PyPI/GitHub, encouraging community contributions and easy integration into existing TTS/NLP stacks.

Methodology

VietNormalizer follows a purely rule‑based approach that sidesteps the need for large language models:

Pre‑compilation – At library initialization, every regular‑expression pattern (e.g., number detection, date formats) is compiled once, avoiding runtime recompilation overhead.
Sequential processing pipeline – Input text passes through a series of deterministic modules:
- Unicode normalization (NFC/NFKC) and removal of emojis/special symbols.
- Number conversion – Handles arbitrary‑length integers, floating‑point numbers, and large magnitudes (thousands, millions, billions) by mapping digits to Vietnamese words.
- Date/Time handling – Recognizes common Vietnamese and ISO date formats, converting them to spoken forms (e.g., “12/03/2024” → “mười hai tháng ba năm hai không bốn”).
- Currency & percentages – Detects VND/USD symbols and percentage signs, expanding them with appropriate units.
- Acronym expansion – Looks up tokens in a user‑provided CSV dictionary (e.g., “AI” → “trí tuệ nhân tạo”).
- Transliteration – Applies a handcrafted mapping table to approximate the phonetics of foreign loanwords (e.g., “Google” → “gu-gồ”).
Output – Returns a normalized string ready for downstream TTS synthesis or NLP tokenization.

Because the pipeline is deterministic, developers can debug, extend, or reorder modules without worrying about stochastic model behavior.

Results & Findings

Speed – Benchmarks on a standard laptop (Intel i7, 16 GB RAM) show processing of ≈ 10,000 sentences per second, far outpacing neural‑based normalizers that require GPU inference.
Memory – The library stays under 30 MB of RAM after loading, making it suitable for edge devices or serverless functions.
Coverage – In a curated test set of 5 k real‑world Vietnamese sentences (social media, news, and TTS scripts), VietNormalizer correctly normalized ≈ 96 % of NSW instances, outperforming existing open‑source tools that typically handle only 60–70 % of the same categories.
Error analysis highlighted a few edge cases (e.g., ambiguous date formats like “01/02/03”) that require contextual disambiguation, which rule‑based logic alone cannot resolve.

Practical Implications

TTS pipelines can now ingest raw user‑generated content (comments, chat logs) without a separate preprocessing step, reducing latency and simplifying deployment.
Voice assistants targeting Vietnamese markets can reliably read out numbers, dates, and foreign brand names, improving user experience.
NLP tasks such as sentiment analysis, named‑entity recognition, or machine translation benefit from a cleaner token stream, leading to higher downstream accuracy.
Serverless or mobile apps can embed the library directly (thanks to its tiny footprint), avoiding costly model downloads and GPU requirements.
Rapid prototyping – Data scientists can plug VietNormalizer into Jupyter notebooks with a single pip install command, accelerating experimentation on Vietnamese corpora.

Limitations & Future Work

The rule‑based system cannot resolve ambiguous contexts (e.g., “03/04/05” could be a date or a version number) without additional linguistic cues.
Domain‑specific slang or newly coined acronyms require manual dictionary updates; the library does not learn new patterns automatically.
Transliteration rules are hand‑crafted and may not capture all phonetic nuances of emerging loanwords.
The authors suggest extending the framework with lightweight statistical disambiguation (e.g., a small CRF model) and exploring cross‑language transfer to other low‑resource tonal languages such as Thai or Burmese.

VietNormalizer demonstrates that a well‑engineered, dependency‑free rule‑based approach can meet the demanding real‑time needs of modern Vietnamese TTS and NLP applications, offering developers a practical tool that bridges the gap between raw user text and high‑quality language processing.

Authors

Hung Vu Nguyen
Loan Do
Thanh Ngoc Nguyen
Ushik Shrestha Khwakhali
Thanh Pham
Vinh Do
Charlotte Nguyen
Hien Nguyen

Paper Information

arXiv ID: 2603.04145v1
Categories: cs.CL, cs.NE
Published: March 4, 2026
PDF: Download PDF

[Paper] VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought