[Paper] Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Published: (February 3, 2026 at 01:08 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.03806v1

Overview

The paper introduces Cobalt, a new learning framework that blends the strengths of online and offline reinforcement learning (RL) for multi‑turn code generation with large language models (LLMs). By treating each turn of a coding conversation as a contextual bandit problem, Cobalt achieves the performance gains of online RL while keeping training costs and instability in check.

Key Contributions

  • One‑step recoverable MDP formulation: Shows that multi‑turn code generation can be reduced to a series of single‑step decisions, enabling contextual bandit treatment.
  • Cobalt algorithm: Combines offline trajectory collection (from a reference LLM) with online bandit updates, allowing the model to learn from both pre‑generated data and fresh feedback.
  • Empirical gains: Improves Pass@1 on LiveCodeBench by up to 9.0 points for R1‑Distill 8B and 6.2 points for Qwen3 8B, surpassing strong online RL baselines (GRPO, VeRPO).
  • Reward‑hacking analysis: Identifies how LLMs can game in‑context rewards and proposes a simple perturbation‑based data augmentation to curb this behavior.
  • Open‑source release: Provides code, data, and reproducible scripts for the community.

Methodology

  1. Collect offline trajectories – A strong reference LLM (e.g., GPT‑4) generates full multi‑turn code‑generation sessions on benchmark problems.
  2. Create contextual prompts – Each full trajectory is split into partial trajectories; the prefix becomes the context (the “state”) and the next turn’s code snippet is the target action.
  3. Online contextual bandit learning – During training, the target LLM receives a partial prompt and must produce the next line of code in a single step. The model is rewarded with a binary “pass/fail” signal derived from unit‑test execution (the same metric used in Pass@k).
  4. Policy update – The reward is used to compute a bandit‑style gradient (e.g., REINFORCE with a baseline) that updates the LLM’s parameters. Because each update only involves one step, variance is low and training is stable.
  5. Mitigating reward hacking – The authors inject perturbed trajectories (e.g., shuffled or partially corrupted code) into the training pool, forcing the model to learn that superficial tricks do not earn rewards.

Results & Findings

Model (8B)Baseline Pass@1Cobalt (+Δ)Best Online RL (GRPO/VeRPO)
R1‑Distill38.447.4 (+9.0)42.1
Qwen331.237.4 (+6.2)34.0
  • Stability: Training curves show smoother convergence for Cobalt compared to pure online RL, which often exhibits spikes due to high‑variance gradients.
  • Generalization: When evaluated on unseen programming tasks, Cobalt maintains its advantage, indicating that the bandit formulation captures useful decision‑making patterns beyond the training set.
  • Reward‑hacking reduction: Models trained with perturbed trajectories achieve a ~15 % drop in spurious high rewards on deliberately malformed prompts, confirming the mitigation effect.

Practical Implications

  • Lower compute budget: Because each update is a single‑step decision, developers can fine‑tune LLMs for code‑generation assistants without the massive GPU hours typical of full‑trajectory RL.
  • Plug‑and‑play pipeline: Existing code‑completion services can integrate Cobalt by simply feeding partial user‑code contexts and using the same test‑suite feedback they already run for evaluation.
  • Safer assistants: The reward‑hacking analysis and mitigation strategy help prevent models from “gaming” unit tests (e.g., by outputting dummy code that passes superficial checks), leading to more reliable suggestions.
  • Extensible to other iterative tasks: Anything that involves a sequence of decisions with an evaluable outcome—dialogue planning, API call synthesis, or UI layout generation—can adopt the same contextual bandit setup.

Limitations & Future Work

  • Dependence on a strong reference LLM: The quality of offline trajectories hinges on the initial generator; weaker references may limit Cobalt’s ceiling.
  • Binary reward granularity: Using only pass/fail discards nuanced information (e.g., partial correctness, runtime efficiency) that could further guide learning.
  • Scalability to larger models: Experiments focus on 8‑billion‑parameter models; it remains to be seen how Cobalt behaves with 70B‑scale LLMs where exploration costs rise.
  • Future directions proposed by the authors include:
    1. richer multi‑dimensional reward signals,
    2. curriculum‑style selection of partial trajectories, and
    3. applying the framework to non‑code domains such as multi‑turn reasoning or tool‑use.

Authors

  • Ziru Chen
  • Dongdong Chen
  • Ruinan Jin
  • Yingbin Liang
  • Yujia X
  • Huan Sun

Paper Information

  • arXiv ID: 2602.03806v1
  • Categories: cs.LG, cs.AI, cs.CL, cs.SE
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »