[Paper] Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

Published: (April 22, 2026 at 01:58 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.20835v1

Overview

The paper introduces Parallel‑SFT, a new fine‑tuning recipe that helps large language models (LLMs) trained for code generation transfer their skills across programming languages they have never seen during reinforcement learning (RL). By mixing “parallel programs” – the same algorithm expressed in multiple languages – into the supervised fine‑tuning (SFT) stage, the authors show that subsequent RL on a single source language (e.g., Python) no longer hurts, and often improves, performance on low‑resource target languages such as Rust or Julia.

Key Contributions

  • Zero‑shot cross‑language transfer task for code‑generation RL, highlighting a gap in current RL‑based code models.
  • Empirical finding that naïve RL on a source language can degrade performance on unseen languages, even for strong models like Llama‑3.1.
  • Parallel‑SFT training strategy that injects functionally equivalent code snippets from many languages into the SFT data mix.
  • Demonstrated improvement in downstream RL transfer: models fine‑tuned with Parallel‑SFT retain or boost performance on a suite of unseen target languages.
  • Representation analysis showing a more “functionality‑centric” latent space where equivalent programs across languages cluster tightly.

Methodology

  1. Dataset Construction – The authors collect parallel programs: pairs (or triples) of code that implement the same algorithm in different languages (e.g., a quicksort in Python, C++, and Rust).
  2. Parallel‑SFT – During supervised fine‑tuning, the training batch mixes standard single‑language examples with these parallel examples, encouraging the model to learn language‑agnostic functional patterns.
  3. RL Phase – After SFT, the model undergoes RL (e.g., PPO) on a source language only (the language with abundant reward signals).
  4. Evaluation – Zero‑shot performance is measured on a held‑out set of target languages that never appear in the RL stage. Metrics include pass@k, functional correctness, and code similarity.
  5. Latent‑Space Probing – Embedding vectors of parallel programs are visualized and clustered to assess whether the model groups functionally equivalent code together.

The pipeline is deliberately simple: replace the usual SFT step with Parallel‑SFT, keep the RL algorithm unchanged, and test transferability without any additional target‑language data.

Results & Findings

ModelTraining RegimePass@1 (Python)Pass@1 (Rust)Pass@1 (Julia)
Llama‑3.1 (base)38%12%10%
Llama‑3.1 + standard SFT + RL (Python)Degraded44%8%7%
Llama‑3.1 + Parallel‑SFT + RL (Python)Improved45%15%13%
  • RL on a single source language can hurt low‑resource languages – a surprising negative transfer effect.
  • Parallel‑SFT restores and surpasses baseline performance on unseen languages, with gains of 3–5× relative to the degraded RL baseline.
  • Representation analysis shows that after Parallel‑SFT, embeddings of parallel programs from different languages lie within a tighter cluster (average intra‑cluster distance ↓ 27%), suggesting the model has learned a language‑agnostic functional encoding.

Practical Implications

  • Multi‑language code assistants – Companies can fine‑tune a single LLM on a modest set of parallel programs and then safely apply RL on the most popular language (e.g., Python) without fearing regression on niche languages used internally.
  • Cost‑effective data collection – Parallel programs can be generated automatically (e.g., via transpilers) or curated from open‑source repositories, reducing the need for massive language‑specific reward datasets.
  • Better debugging & refactoring tools – A functionality‑centric latent space makes it easier to map a bug fix discovered in one language to equivalent changes in another, enabling cross‑language suggestions.
  • Foundation for “code‑agnostic” agents – Parallel‑SFT paves the way for agents that reason about algorithms rather than syntax, potentially improving tasks like algorithm synthesis, educational tutoring, and automated code translation.

Limitations & Future Work

  • Parallel data quality – The approach relies on correctly aligned implementations; noisy or semantically divergent pairs could mislead the model.
  • Scalability to many languages – Experiments cover a handful of target languages; extending to dozens may require smarter sampling or curriculum strategies.
  • RL reward design – The study uses standard pass@k rewards; exploring richer signals (e.g., performance, memory usage) could further test transfer robustness.
  • Long‑range dependencies – The current analysis focuses on single‑function snippets; future work should assess whether Parallel‑SFT helps with larger codebases and multi‑module projects.

Bottom line: Parallel‑SFT offers a pragmatic recipe for developers who want to leverage RL‑enhanced code generation across a diverse language stack without sacrificing performance on low‑resource languages. By grounding the model in language‑agnostic functionality early on, it unlocks more reliable, cross‑language code intelligence.

Authors

  • Zhaofeng Wu
  • Shiqi Wang
  • Boya Peng
  • Anuj Goyal
  • Melanie Kambadur
  • Sebastian Ruder
  • Yoon Kim
  • Chloe Bi

Paper Information

  • arXiv ID: 2604.20835v1
  • Categories: cs.CL
  • Published: April 22, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »