[Paper] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting
Source: arXiv - 2512.10780v1
Overview
A new study — Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real‑World Setting — shows that large language models (LLMs) that power clinical chatbots stumble when users type Indian‑language text in the Roman alphabet (e.g., “namaste” written as “namaste” instead of Devanagari). In a large‑scale maternal‑and‑newborn health triage dataset, the authors find a 5‑12 point drop in F1 score for romanized inputs, a gap that could translate into millions of mis‑triaged cases in practice.
Key Contributions
- First real‑world benchmark of LLM‑based health triage across five Indian languages plus Nepali, comparing native‑script and romanized user queries.
- Quantitative evidence of a systematic performance drop (5‑12 F1 points) when the same intent is expressed in Roman script.
- Error‑analysis framework that separates semantic understanding from downstream classification, revealing that models often “get” the intent but still output the wrong triage label.
- Impact estimate: at the partner maternal‑health organization, the script gap could cause roughly 2 million extra triage errors per year.
- Open‑source release of the annotated dataset and evaluation scripts to spur further research on orthographic robustness.
Methodology
- Data collection – The team partnered with a maternal‑health NGO in India to gather ~120 k anonymized triage queries submitted by expectant mothers and caregivers. Each query is labeled with a clinical urgency tier (e.g., “immediate referral”, “routine advice”).
- Script conversion – For every native‑script message, a professional linguist produced a faithful romanized version, preserving spelling conventions used on mobile keyboards.
- Model selection – Prominent LLMs (OpenAI GPT‑4, Anthropic Claude, Google PaLM 2, and a fine‑tuned LLaMA 2) were prompted with a zero‑shot “triage classification” instruction. No additional language‑specific fine‑tuning was applied.
- Evaluation – Standard precision, recall, and F1 scores were computed separately for native‑script and romanized subsets. A secondary “intent‑recovery” test measured whether the model could correctly paraphrase the user’s concern, regardless of the final triage label.
- Impact modeling – Using the organization’s historical call volume, the authors projected how the observed F1 gap would affect total triage errors over a year.
Results & Findings
| Language | Script | F1 (best LLM) | Δ F1 (Roman vs Native) |
|---|---|---|---|
| Hindi | Devanagari | 0.84 | –0.09 |
| Marathi | Devanagari | 0.81 | –0.07 |
| Tamil | Tamil | 0.78 | –0.12 |
| Telugu | Telugu | 0.80 | –0.08 |
| Bengali | Bengali | 0.83 | –0.05 |
| Nepali | Devanagari | 0.82 | –0.06 |
- Semantic grasp: In >85 % of romanized cases, the model’s internal “thought” (captured via chain‑of‑thought prompting) correctly identified the medical issue.
- Brittle output: The final classification step failed disproportionately when orthographic noise (misspellings, mixed scripts) was present.
- Real‑world cost: Applying the average 8‑point F1 loss to the partner’s ~25 M annual triage interactions yields an estimated ≈2 M additional mis‑classifications, many of which could delay urgent care.
Practical Implications
- Product teams building health‑chatbots for multilingual markets must validate both intent extraction and downstream decision logic on romanized inputs; a “pass” on intent does not guarantee safe outcomes.
- Data pipelines should incorporate script‑normalization (e.g., transliteration to native script) before feeding text to LLMs, or train script‑agnostic adapters that learn robust token embeddings across orthographies.
- Regulatory compliance: In jurisdictions where clinical decision support is regulated, the script gap could be considered a safety risk, prompting the need for script‑specific performance audits.
- Developer tooling: The released dataset can be used to fine‑tune or evaluate custom classifiers, prompting libraries (e.g., Hugging Face Transformers) to add “romanization‑aware” preprocessing modules.
- Beyond healthcare: Any LLM‑driven customer‑support or finance bot serving Indian users will likely encounter the same orthographic variability, making these findings broadly relevant.
Limitations & Future Work
- Zero‑shot focus: The study evaluates off‑the‑shelf LLMs without language‑specific fine‑tuning; future work should explore whether targeted fine‑tuning on romanized corpora narrows the gap.
- Script diversity: Only five Indian languages plus Nepali were examined; many regional languages (e.g., Gujarati, Malayalam) remain untested.
- User behavior: Real‑world queries often mix scripts within a single message; the current binary native/roman split does not capture this code‑switching nuance.
- Safety metrics: The impact estimate assumes uniform error cost; a more granular clinical risk assessment (e.g., severity weighting) would refine the real‑world stakes.
Bottom line: The research shines a light on a hidden vulnerability—LLMs can “understand” romanized Indian‑language text but still make unsafe triage decisions. Addressing script robustness is now a concrete, high‑impact priority for anyone deploying LLMs in multilingual, high‑stakes domains.
Authors
- Manurag Khullar
- Utkarsh Desai
- Poorva Malviya
- Aman Dalmia
- Zheyuan Ryan Shi
Paper Information
- arXiv ID: 2512.10780v1
- Categories: cs.CL, cs.LG
- Published: December 11, 2025
- PDF: Download PDF