[Paper] IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Published: 3 days ago (February 25, 2026 at 12:12 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22125v1

Overview

The paper “IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages” fills a glaring gap in the evaluation of large language models (LLMs): most existing instruction‑following tests are English‑only, even though billions of people use Indic languages daily. By introducing a rigorously verified, rule‑based benchmark that works across Hindi, Bengali, Tamil, Telugu, and ten other languages, the authors give researchers and developers a concrete way to measure how well LLMs can follow structured prompts in these under‑represented tongues.

Key Contributions

Multilingual Benchmark: 14‑language suite (IndicIFEval) with ~800 human‑verified examples per language.
Two Complementary Subsets:
1. IndicIFEval‑Ground – localized translations of the English IFEval prompts, adapted for cultural relevance.
2. IndicIFEval‑Synth – synthetically created, rule‑driven instructions rooted in native Indic content.
Automatic Verifiability: Every task includes deterministic, rule‑based checks (e.g., format, lexical constraints) that allow scripts to score model outputs without manual grading.
Comprehensive Model Survey: Evaluation of both open‑weight (e.g., LLaMA, Mistral) and proprietary (e.g., GPT‑4, Claude) models, covering reasoning‑heavy and purely generative variants.
Open‑Source Release: Benchmark data, evaluation scripts, and documentation are publicly available on GitHub, encouraging community contributions.

Methodology

Prompt Construction
- Grounded Set: Existing English IFEval prompts were translated by native speakers and then “localized” – idioms, cultural references, and domain‑specific terms were swapped for equivalents that make sense in each language.
- Synthetic Set: A rule engine generated instructions (e.g., “List three fruits that start with the letter ‘k’ in Tamil”) using language‑specific lexical resources (word lists, morphological rules).
Human Verification
- Each translated or synthetic example was reviewed by at least two native annotators to ensure grammaticality, cultural appropriateness, and that the verification rule (e.g., “output must be a JSON array”) is enforceable.
Evaluation Pipeline
- Models receive the instruction and must produce output that satisfies both the semantic request and the formatting constraint (JSON, bullet list, etc.).
- An open‑source script parses the response, checks the format, and then runs a deterministic validator (e.g., regex, lookup table) to confirm correctness.
- Scores are aggregated per language and per task type (lexical, reasoning, cross‑lingual).
Model Suite
- Open‑weight: LLaMA‑2 (7B/13B), Mistral‑7B, Falcon‑40B, etc.
- Proprietary: GPT‑4, Claude‑2, Gemini‑Pro.
- Both “reasoning” (chain‑of‑thought enabled) and “non‑reasoning” variants were tested to see how prompting style influences performance.

Results & Findings

Category	Best Open‑Weight Model	Best Proprietary Model	Observations
Formatting adherence	~96% (Mistral‑7B)	~99% (GPT‑4)	Models reliably respect JSON / bullet constraints.
Lexical tasks (e.g., list items, spelling)	45–58%	70–82%	Significant drop compared to English benchmarks; even top models miss many language‑specific words.
Cross‑lingual reasoning (translate‑then‑answer)	38%	61%	Reasoning models improve scores but still lag far behind English performance (~90%).
Overall Indic‑wide average	52%	73%	The gap between high‑resource (Hindi) and low‑resource (Assamese, Konkani) languages is pronounced.

What it means:

LLMs are good at obeying structural constraints (they can output valid JSON), but they struggle with the content side when the prompt is in an Indic language.
Even the most advanced closed‑source models lose 15–30 points compared to their English scores, highlighting a systemic multilingual deficiency.

Practical Implications

Product Localization: Companies building chatbots, virtual assistants, or documentation generators for Indian markets now have a concrete metric to gauge whether their models will actually follow user instructions in Hindi, Tamil, etc.
Compliance & Data Extraction: Many enterprise workflows rely on structured outputs (JSON, CSV). IndicIFEval shows that while format compliance is reliable, the semantic correctness of extracted entities (names, dates, product codes) still needs improvement.
Fine‑Tuning Roadmaps: The benchmark can be used as a validation set for domain‑specific fine‑tuning or instruction‑tuning pipelines, helping teams prioritize language‑specific tokenizers, vocab expansions, or adapter layers.
Open‑Source Ecosystem: Researchers can benchmark new multilingual LLMs (e.g., BLOOM‑Z, IndicBERT‑LLM) against a shared, verifiable standard, accelerating community‑driven progress.

Limitations & Future Work

Coverage Bias: Although 14 languages are included, the benchmark leans heavily toward languages with relatively larger digital corpora (Hindi, Bengali). Ultra‑low‑resource languages like Bodo or Manipuri are absent.
Rule‑Based Validation Ceiling: The deterministic validators capture only a subset of possible correct answers; nuanced semantic variations may be penalized as errors.
Prompt Diversity: Current tasks focus on constrained generation (lists, JSON). Future versions could add open‑ended reasoning, code generation, or multimodal instructions.
Model Access: The study’s proprietary model results rely on API black‑boxes, limiting reproducibility for the broader community.

The authors plan to expand IndicIFEval with more languages, richer task types, and community‑submitted adversarial examples to keep the benchmark both challenging and representative.

Authors

Thanmay Jayakumar
Mohammed Safi Ur Rahman Khan
Raj Dabre
Ratish Puduppully
Anoop Kunchukuttan

Paper Information

arXiv ID: 2602.22125v1
Categories: cs.CL
Published: February 25, 2026
PDF: Download PDF

[Paper] IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables