[Paper] IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
Source: arXiv - 2602.22125v1
Overview
The paper “IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages” fills a glaring gap in the evaluation of large language models (LLMs): most existing instruction‑following tests are English‑only, even though billions of people use Indic languages daily. By introducing a rigorously verified, rule‑based benchmark that works across Hindi, Bengali, Tamil, Telugu, and ten other languages, the authors give researchers and developers a concrete way to measure how well LLMs can follow structured prompts in these under‑represented tongues.
Key Contributions
- Multilingual Benchmark: 14‑language suite (IndicIFEval) with ~800 human‑verified examples per language.
- Two Complementary Subsets:
- IndicIFEval‑Ground – localized translations of the English IFEval prompts, adapted for cultural relevance.
- IndicIFEval‑Synth – synthetically created, rule‑driven instructions rooted in native Indic content.
- Automatic Verifiability: Every task includes deterministic, rule‑based checks (e.g., format, lexical constraints) that allow scripts to score model outputs without manual grading.
- Comprehensive Model Survey: Evaluation of both open‑weight (e.g., LLaMA, Mistral) and proprietary (e.g., GPT‑4, Claude) models, covering reasoning‑heavy and purely generative variants.
- Open‑Source Release: Benchmark data, evaluation scripts, and documentation are publicly available on GitHub, encouraging community contributions.
Methodology
-
Prompt Construction
- Grounded Set: Existing English IFEval prompts were translated by native speakers and then “localized” – idioms, cultural references, and domain‑specific terms were swapped for equivalents that make sense in each language.
- Synthetic Set: A rule engine generated instructions (e.g., “List three fruits that start with the letter ‘k’ in Tamil”) using language‑specific lexical resources (word lists, morphological rules).
-
Human Verification
- Each translated or synthetic example was reviewed by at least two native annotators to ensure grammaticality, cultural appropriateness, and that the verification rule (e.g., “output must be a JSON array”) is enforceable.
-
Evaluation Pipeline
- Models receive the instruction and must produce output that satisfies both the semantic request and the formatting constraint (JSON, bullet list, etc.).
- An open‑source script parses the response, checks the format, and then runs a deterministic validator (e.g., regex, lookup table) to confirm correctness.
- Scores are aggregated per language and per task type (lexical, reasoning, cross‑lingual).
-
Model Suite
- Open‑weight: LLaMA‑2 (7B/13B), Mistral‑7B, Falcon‑40B, etc.
- Proprietary: GPT‑4, Claude‑2, Gemini‑Pro.
- Both “reasoning” (chain‑of‑thought enabled) and “non‑reasoning” variants were tested to see how prompting style influences performance.
Results & Findings
| Category | Best Open‑Weight Model | Best Proprietary Model | Observations |
|---|---|---|---|
| Formatting adherence | ~96% (Mistral‑7B) | ~99% (GPT‑4) | Models reliably respect JSON / bullet constraints. |
| Lexical tasks (e.g., list items, spelling) | 45–58% | 70–82% | Significant drop compared to English benchmarks; even top models miss many language‑specific words. |
| Cross‑lingual reasoning (translate‑then‑answer) | 38% | 61% | Reasoning models improve scores but still lag far behind English performance (~90%). |
| Overall Indic‑wide average | 52% | 73% | The gap between high‑resource (Hindi) and low‑resource (Assamese, Konkani) languages is pronounced. |
What it means:
- LLMs are good at obeying structural constraints (they can output valid JSON), but they struggle with the content side when the prompt is in an Indic language.
- Even the most advanced closed‑source models lose 15–30 points compared to their English scores, highlighting a systemic multilingual deficiency.
Practical Implications
- Product Localization: Companies building chatbots, virtual assistants, or documentation generators for Indian markets now have a concrete metric to gauge whether their models will actually follow user instructions in Hindi, Tamil, etc.
- Compliance & Data Extraction: Many enterprise workflows rely on structured outputs (JSON, CSV). IndicIFEval shows that while format compliance is reliable, the semantic correctness of extracted entities (names, dates, product codes) still needs improvement.
- Fine‑Tuning Roadmaps: The benchmark can be used as a validation set for domain‑specific fine‑tuning or instruction‑tuning pipelines, helping teams prioritize language‑specific tokenizers, vocab expansions, or adapter layers.
- Open‑Source Ecosystem: Researchers can benchmark new multilingual LLMs (e.g., BLOOM‑Z, IndicBERT‑LLM) against a shared, verifiable standard, accelerating community‑driven progress.
Limitations & Future Work
- Coverage Bias: Although 14 languages are included, the benchmark leans heavily toward languages with relatively larger digital corpora (Hindi, Bengali). Ultra‑low‑resource languages like Bodo or Manipuri are absent.
- Rule‑Based Validation Ceiling: The deterministic validators capture only a subset of possible correct answers; nuanced semantic variations may be penalized as errors.
- Prompt Diversity: Current tasks focus on constrained generation (lists, JSON). Future versions could add open‑ended reasoning, code generation, or multimodal instructions.
- Model Access: The study’s proprietary model results rely on API black‑boxes, limiting reproducibility for the broader community.
The authors plan to expand IndicIFEval with more languages, richer task types, and community‑submitted adversarial examples to keep the benchmark both challenging and representative.
Authors
- Thanmay Jayakumar
- Mohammed Safi Ur Rahman Khan
- Raj Dabre
- Ratish Puduppully
- Anoop Kunchukuttan
Paper Information
- arXiv ID: 2602.22125v1
- Categories: cs.CL
- Published: February 25, 2026
- PDF: Download PDF