[Paper] Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training

Published: (February 5, 2026 at 12:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.05940v1

Overview

The paper introduces TRIT (Translation‑Reasoning Integrated Training), a self‑improving framework that teaches large language models to both translate and reason in tandem. By weaving translation directly into the reasoning pipeline, TRIT lifts the long‑standing gap where multilingual models default to English reasoning or suffer steep accuracy drops when forced to stay in the question’s native language. The result is a single model that can understand multilingual math problems and produce answers that are both correct and linguistically consistent.

Key Contributions

  • Unified training of translation and reasoning – eliminates the need for separate translation modules or external multilingual data.
  • Self‑improving loop – the model generates its own translation‑reasoning pairs, continuously refining both abilities.
  • Significant performance boost – on the multilingual math benchmark MMATH, TRIT gains ~7 % absolute accuracy over strong baselines.
  • Cross‑lingual alignment improvement – question‑language understanding rises by >10 % points, reducing the “English‑only” bias.
  • Better translation quality – achieves up to +8.4 COMET points on FLORES‑200, showing that reasoning training also sharpens pure translation.

Methodology

  1. Data Construction – Starting from existing multilingual question‑answer pairs (e.g., MMATH), the authors generate synthetic translation‑reasoning triples. Each triple contains:

    • The original question in language L.
    • A machine‑generated translation of the question into English.
    • A step‑by‑step reasoning trace (in English) that leads to the answer.
  2. Integrated Training Objective – The model is trained with a single loss that simultaneously rewards:

    • Accurate translation of the question into English.
    • Correct reasoning steps that follow the translated question.
    • Proper generation of the final answer in the original language.
  3. Self‑Improvement Cycle – After an initial training pass, the model is used to re‑translate and re‑reason on the same data, producing higher‑quality triples. These refreshed triples replace the older ones, and the model is fine‑tuned again. The loop repeats a few times, each iteration “teaching” the model to be better at both tasks without any human‑in‑the‑loop annotation.

  4. Evaluation – Performance is measured on:

    • Answer correctness (exact match on math problems).
    • Language consistency (whether the answer is expressed in the same language as the question).
    • Translation quality (COMET scores on FLORES‑200).

Results & Findings

MetricBaseline (multilingual LLM)TRIT (final iteration)
MMATH overall accuracy~58 %~65 % (+7 pts)
Cross‑lingual question alignment~68 %~78 % (+10 pts)
FLORES‑200 COMET (translation)71.279.6 (+8.4)
Language‑consistent answer rate62 %71 %

What this means:

  • Reasoning quality improves because the model no longer has to “guess” the English meaning of a foreign question—it sees a clean translation it helped produce.
  • Language consistency rises, so developers can trust that the model will answer in the same language the user asked, a crucial feature for multilingual chatbots or tutoring apps.
  • Translation gains demonstrate a pleasant side‑effect: training on reasoning also sharpens the model’s pure translation ability, suggesting a shared representation between the two tasks.

Practical Implications

  • Multilingual AI assistants can now handle complex, multi‑step queries (e.g., math, logic puzzles) without falling back to English, delivering a smoother user experience across markets.
  • Educational tech platforms that support dozens of languages can rely on a single model for both problem translation and solution generation, cutting infrastructure and maintenance costs.
  • Cross‑border data pipelines (e.g., extracting insights from multilingual reports) can embed TRIT‑style training to keep the semantic meaning intact while performing downstream reasoning.
  • Developer workflow – the self‑improving loop requires only the original multilingual QA data; no extra translation corpora or human annotation is needed, making it easy to adopt on existing datasets.

Limitations & Future Work

  • Domain scope – Experiments focus on mathematical reasoning; it remains to be seen how well TRIT transfers to other domains such as legal reasoning or code synthesis.
  • Resource demand – The iterative self‑training loop adds extra compute cycles compared to a one‑shot fine‑tune, which may be a barrier for smaller teams.
  • Language coverage – While FLORES‑200 includes 200 languages, the benchmark used (MMATH) only spans a subset; low‑resource languages with scarce training data might still lag behind.
  • Future directions suggested by the authors include: extending TRIT to multimodal inputs (e.g., diagrams), integrating external knowledge bases to further boost reasoning depth, and exploring curriculum‑learning schedules that prioritize harder languages earlier in the loop.

Authors

  • Junxiao Liu
  • Zhijun Wang
  • Yixiao Li
  • Zhejian Lai
  • Liqian Huang
  • Xin Huang
  • Xue Han
  • Junlan Feng
  • Shujian Huang

Paper Information

  • arXiv ID: 2602.05940v1
  • Categories: cs.CL
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »