[Paper] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting

Published: 1 month ago (December 11, 2025 at 11:15 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10780v1

Overview

A new study — Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real‑World Setting — shows that large language models (LLMs) that power clinical chatbots stumble when users type Indian‑language text in the Roman alphabet (e.g., “namaste” written as “namaste” instead of Devanagari). In a large‑scale maternal‑and‑newborn health triage dataset, the authors find a 5‑12 point drop in F1 score for romanized inputs, a gap that could translate into millions of mis‑triaged cases in practice.

Key Contributions

First real‑world benchmark of LLM‑based health triage across five Indian languages plus Nepali, comparing native‑script and romanized user queries.
Quantitative evidence of a systematic performance drop (5‑12 F1 points) when the same intent is expressed in Roman script.
Error‑analysis framework that separates semantic understanding from downstream classification, revealing that models often “get” the intent but still output the wrong triage label.
Impact estimate: at the partner maternal‑health organization, the script gap could cause roughly 2 million extra triage errors per year.
Open‑source release of the annotated dataset and evaluation scripts to spur further research on orthographic robustness.

Methodology

Data collection – The team partnered with a maternal‑health NGO in India to gather ~120 k anonymized triage queries submitted by expectant mothers and caregivers. Each query is labeled with a clinical urgency tier (e.g., “immediate referral”, “routine advice”).
Script conversion – For every native‑script message, a professional linguist produced a faithful romanized version, preserving spelling conventions used on mobile keyboards.
Model selection – Prominent LLMs (OpenAI GPT‑4, Anthropic Claude, Google PaLM 2, and a fine‑tuned LLaMA 2) were prompted with a zero‑shot “triage classification” instruction. No additional language‑specific fine‑tuning was applied.
Evaluation – Standard precision, recall, and F1 scores were computed separately for native‑script and romanized subsets. A secondary “intent‑recovery” test measured whether the model could correctly paraphrase the user’s concern, regardless of the final triage label.
Impact modeling – Using the organization’s historical call volume, the authors projected how the observed F1 gap would affect total triage errors over a year.

Results & Findings

Language	Script	F1 (best LLM)	Δ F1 (Roman vs Native)
Hindi	Devanagari	0.84	–0.09
Marathi	Devanagari	0.81	–0.07
Tamil	Tamil	0.78	–0.12
Telugu	Telugu	0.80	–0.08
Bengali	Bengali	0.83	–0.05
Nepali	Devanagari	0.82	–0.06

Semantic grasp: In >85 % of romanized cases, the model’s internal “thought” (captured via chain‑of‑thought prompting) correctly identified the medical issue.
Brittle output: The final classification step failed disproportionately when orthographic noise (misspellings, mixed scripts) was present.
Real‑world cost: Applying the average 8‑point F1 loss to the partner’s ~25 M annual triage interactions yields an estimated ≈2 M additional mis‑classifications, many of which could delay urgent care.

Practical Implications

Product teams building health‑chatbots for multilingual markets must validate both intent extraction and downstream decision logic on romanized inputs; a “pass” on intent does not guarantee safe outcomes.
Data pipelines should incorporate script‑normalization (e.g., transliteration to native script) before feeding text to LLMs, or train script‑agnostic adapters that learn robust token embeddings across orthographies.
Regulatory compliance: In jurisdictions where clinical decision support is regulated, the script gap could be considered a safety risk, prompting the need for script‑specific performance audits.
Developer tooling: The released dataset can be used to fine‑tune or evaluate custom classifiers, prompting libraries (e.g., Hugging Face Transformers) to add “romanization‑aware” preprocessing modules.
Beyond healthcare: Any LLM‑driven customer‑support or finance bot serving Indian users will likely encounter the same orthographic variability, making these findings broadly relevant.

Limitations & Future Work

Zero‑shot focus: The study evaluates off‑the‑shelf LLMs without language‑specific fine‑tuning; future work should explore whether targeted fine‑tuning on romanized corpora narrows the gap.
Script diversity: Only five Indian languages plus Nepali were examined; many regional languages (e.g., Gujarati, Malayalam) remain untested.
User behavior: Real‑world queries often mix scripts within a single message; the current binary native/roman split does not capture this code‑switching nuance.
Safety metrics: The impact estimate assumes uniform error cost; a more granular clinical risk assessment (e.g., severity weighting) would refine the real‑world stakes.

Bottom line: The research shines a light on a hidden vulnerability—LLMs can “understand” romanized Indian‑language text but still make unsafe triage decisions. Addressing script robustness is now a concrete, high‑impact priority for anyone deploying LLMs in multilingual, high‑stakes domains.

Authors

Manurag Khullar
Utkarsh Desai
Poorva Malviya
Aman Dalmia
Zheyuan Ryan Shi

Paper Information

arXiv ID: 2512.10780v1
Categories: cs.CL, cs.LG
Published: December 11, 2025
PDF: Download PDF

[Paper] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models