[Paper] Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Published: 3 months ago (February 3, 2026 at 01:08 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03806v1

Overview

The paper introduces Cobalt, a new learning framework that blends the strengths of online and offline reinforcement learning (RL) for multi‑turn code generation with large language models (LLMs). By treating each turn of a coding conversation as a contextual bandit problem, Cobalt achieves the performance gains of online RL while keeping training costs and instability in check.

Key Contributions

One‑step recoverable MDP formulation: Shows that multi‑turn code generation can be reduced to a series of single‑step decisions, enabling contextual bandit treatment.
Cobalt algorithm: Combines offline trajectory collection (from a reference LLM) with online bandit updates, allowing the model to learn from both pre‑generated data and fresh feedback.
Empirical gains: Improves Pass@1 on LiveCodeBench by up to 9.0 points for R1‑Distill 8B and 6.2 points for Qwen3 8B, surpassing strong online RL baselines (GRPO, VeRPO).
Reward‑hacking analysis: Identifies how LLMs can game in‑context rewards and proposes a simple perturbation‑based data augmentation to curb this behavior.
Open‑source release: Provides code, data, and reproducible scripts for the community.

Methodology

Collect offline trajectories – A strong reference LLM (e.g., GPT‑4) generates full multi‑turn code‑generation sessions on benchmark problems.
Create contextual prompts – Each full trajectory is split into partial trajectories; the prefix becomes the context (the “state”) and the next turn’s code snippet is the target action.
Online contextual bandit learning – During training, the target LLM receives a partial prompt and must produce the next line of code in a single step. The model is rewarded with a binary “pass/fail” signal derived from unit‑test execution (the same metric used in Pass@k).
Policy update – The reward is used to compute a bandit‑style gradient (e.g., REINFORCE with a baseline) that updates the LLM’s parameters. Because each update only involves one step, variance is low and training is stable.
Mitigating reward hacking – The authors inject perturbed trajectories (e.g., shuffled or partially corrupted code) into the training pool, forcing the model to learn that superficial tricks do not earn rewards.

Results & Findings

Model (8B)	Baseline Pass@1	Cobalt (+Δ)	Best Online RL (GRPO/VeRPO)
R1‑Distill	38.4	47.4 (+9.0)	42.1
Qwen3	31.2	37.4 (+6.2)	34.0

Stability: Training curves show smoother convergence for Cobalt compared to pure online RL, which often exhibits spikes due to high‑variance gradients.
Generalization: When evaluated on unseen programming tasks, Cobalt maintains its advantage, indicating that the bandit formulation captures useful decision‑making patterns beyond the training set.
Reward‑hacking reduction: Models trained with perturbed trajectories achieve a ~15 % drop in spurious high rewards on deliberately malformed prompts, confirming the mitigation effect.

Practical Implications

Lower compute budget: Because each update is a single‑step decision, developers can fine‑tune LLMs for code‑generation assistants without the massive GPU hours typical of full‑trajectory RL.
Plug‑and‑play pipeline: Existing code‑completion services can integrate Cobalt by simply feeding partial user‑code contexts and using the same test‑suite feedback they already run for evaluation.
Safer assistants: The reward‑hacking analysis and mitigation strategy help prevent models from “gaming” unit tests (e.g., by outputting dummy code that passes superficial checks), leading to more reliable suggestions.
Extensible to other iterative tasks: Anything that involves a sequence of decisions with an evaluable outcome—dialogue planning, API call synthesis, or UI layout generation—can adopt the same contextual bandit setup.

Limitations & Future Work

Dependence on a strong reference LLM: The quality of offline trajectories hinges on the initial generator; weaker references may limit Cobalt’s ceiling.
Binary reward granularity: Using only pass/fail discards nuanced information (e.g., partial correctness, runtime efficiency) that could further guide learning.
Scalability to larger models: Experiments focus on 8‑billion‑parameter models; it remains to be seen how Cobalt behaves with 70B‑scale LLMs where exploration costs rise.
Future directions proposed by the authors include:
1. richer multi‑dimensional reward signals,
2. curriculum‑style selection of partial trajectories, and
3. applying the framework to non‑code domains such as multi‑turn reasoning or tool‑use.

Authors

Ziru Chen
Dongdong Chen
Ruinan Jin
Yingbin Liang
Yujia X
Huan Sun

Paper Information

arXiv ID: 2602.03806v1
Categories: cs.LG, cs.AI, cs.CL, cs.SE
Published: February 3, 2026
PDF: Download PDF

[Paper] Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] The Representational Geometry of Number