[Paper] Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation
Source: arXiv - 2602.04764v1
Overview
The paper investigates whether feeding a large language model (LLM) thousands of translation examples at inference time—rather than the usual handful—can meaningfully improve machine translation for low‑resource languages. By stretching the model’s context window to up to 1 million tokens, the authors explore how different kinds of in‑context supervision (monolingual, instruction‑style, and parallel data) affect translation quality for Javanese and Sundanese.
Key Contributions
- Scale‑up of in‑context learning (ICL): Demonstrates the feasibility of using up to 1 M tokens of demonstrations for MT, far beyond the typical few‑shot setting.
- Systematic comparison of supervision types: Evaluates monolingual unsupervised data, instruction‑style prompts, and true parallel corpora as sources of in‑context examples.
- Empirical saturation analysis: Shows that translation quality improves quickly with a few hundred examples but plateaus—or even degrades—when the context window is filled.
- Competitive monolingual supervision: Finds that certain monolingual prompt designs can rival parallel data despite lacking direct source‑target pairs.
- Guidelines for long‑context MT: Provides practical recommendations on how much context to use and which data type to prioritize for low‑resource scenarios.
Methodology
- Model & Context Window: The authors use a “long‑context” LLM capable of handling up to 1 M tokens (e.g., a transformer with extended attention).
- Demo Construction: Three corpora are prepared:
- Monolingual: Sentences in the target language paired with generic prompts (e.g., “Translate this to X”).
- Instruction‑style: Human‑written prompts that describe the translation task in natural language.
- Parallel: Classic bilingual sentence pairs (English‑target and Indonesian‑target).
- Scaling Procedure: For each corpus type, they incrementally increase the number of in‑context examples (e.g., 8, 32, 128, 512, … up to the token limit).
- Evaluation: Translation quality is measured on held‑out Javanese and Sundanese test sets using BLEU and chrF scores, with statistical significance testing to detect saturation points.
- Analysis: They track performance trends, compute per‑token efficiency, and inspect failure cases when the context window is near capacity.
Results & Findings
| Corpus Type | Best BLEU (Javanese) | Best BLEU (Sundanese) | Saturation Point |
|---|---|---|---|
| Monolingual (unsupervised) | 23.4 | 21.9 | ~256 examples (≈ 8 k tokens) |
| Instruction‑style | 24.1 | 22.5 | ~512 examples (≈ 16 k tokens) |
| Parallel (Eng‑Target) | 25.3 | 23.8 | ~1 k examples (≈ 32 k tokens) |
| Parallel (Ind‑Target) | 25.7 | 24.2 | ~1 k examples (≈ 32 k tokens) |
- Rapid early gains: Adding the first few hundred demonstrations yields the bulk of the improvement (≈ 80 % of the total gain).
- Diminishing returns: Beyond ~1 k examples, BLEU scores plateau and even dip when the context window exceeds ~200 k tokens, likely due to attention dilution and prompt overload.
- Monolingual vs. Parallel: Certain monolingual prompt formats (e.g., “Write a fluent sentence about …”) achieve BLEU within 1–2 points of parallel data, showing that direct bilingual supervision isn’t strictly necessary for modest gains.
- Corpus‑type sensitivity: Instruction‑style prompts are more robust to larger context sizes, while raw parallel examples suffer earlier degradation.
Practical Implications
- Low‑resource deployment: Developers can boost translation quality for niche languages by simply appending a few hundred well‑crafted examples to each inference request—no fine‑tuning required.
- Prompt engineering over data collection: Investing effort in high‑quality monolingual or instruction prompts may be cheaper and faster than gathering parallel corpora for every new language pair.
- Context window budgeting: When using LLM APIs with token limits (e.g., OpenAI’s 128 k token cap), aim for ≤ 30 k tokens of demonstrations to stay on the “sweet spot” of performance.
- Hybrid pipelines: Combine a small set of parallel examples (for anchor quality) with larger monolingual or instruction blocks to maximize ROI on token usage.
- Edge‑device inference: For on‑device models with limited memory, the findings suggest that a modest demo buffer (few hundred sentences) is sufficient, keeping latency manageable.
Limitations & Future Work
- Language scope: Experiments are limited to Javanese and Sundanese; results may differ for languages with more divergent scripts or morphology.
- Model size dependency: The study uses a single long‑context LLM; scaling behavior could change with larger or smaller models.
- Prompt diversity: Only three broad corpus categories were examined; richer prompt variations (e.g., code‑mixed, domain‑specific) remain unexplored.
- Evaluation metrics: BLEU/chrF capture surface similarity but may miss nuanced adequacy or fluency improvements; human evaluation would strengthen conclusions.
- Future directions: Investigate adaptive demo selection (retrieving the most relevant examples per source sentence), explore multi‑turn interactive ICL, and test the approach on truly zero‑resource languages where no parallel data exists at all.
Authors
- Luis Frentzen Salim
- Esteban Carlin
- Alexandre Morinvil
- Xi Ai
- Lun‑Wei Ku
Paper Information
- arXiv ID: 2602.04764v1
- Categories: cs.CL
- Published: February 4, 2026
- PDF: Download PDF