[Paper] Bellman Calibration for V-Learning in Offline Reinforcement Learning

Published: (December 29, 2025 at 01:52 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23694v1

Overview

The paper presents Iterated Bellman Calibration, a lightweight, model‑agnostic post‑processing step that sharpens off‑policy value estimates in offline reinforcement learning (RL). By repeatedly aligning predicted long‑term returns with one‑step Bellman consistency, the method improves the reliability of value functions without demanding strong assumptions such as Bellman completeness.

Key Contributions

  • Iterated Bellman Calibration (IBC): a simple, plug‑in procedure that can be applied to any existing value estimator (e.g., fitted Q‑iteration, neural‑network critics).
  • Doubly robust pseudo‑outcome: leverages importance weighting and a learned model of the dynamics to construct unbiased one‑step Bellman targets from offline data.
  • Histogram & isotonic calibration extensions: adapts classic calibration tools to the sequential, counterfactual RL setting, yielding a one‑dimensional fitted‑value‑iteration loop.
  • Finite‑sample guarantees: provides theoretical bounds on both calibration error and final value‑prediction error under weak, realistic conditions (no Bellman completeness or realizability required).
  • Model‑agnostic applicability: works with tabular, linear, and deep‑network value functions alike, making it a practical add‑on for existing pipelines.

Methodology

  1. Start with any off‑policy value estimator ( \hat V ) trained on a static dataset of trajectories collected under a behavior policy.
  2. Compute a doubly robust pseudo‑outcome for each state:
    [ \tilde Y = r + \gamma \hat V(s’) + \frac{\pi(a|s)}{\mu(a|s)}\bigl(r + \gamma \hat V(s’) - \hat Q(s,a)\bigr) ]
    where ( \pi ) is the target policy, ( \mu ) the behavior policy, and ( \hat Q ) a learned Q‑function. This term corrects for distribution shift while keeping variance low.
  3. Calibrate: treat (\hat V(s)) as a “score” and regress the pseudo‑outcomes (\tilde Y) onto these scores using either a histogram binning or isotonic regression. The regression function (g) maps raw predictions to calibrated values ( \hat V_{\text{cal}}(s)=g(\hat V(s))).
  4. Iterate: replace (\hat V) with (\hat V_{\text{cal}}) and repeat steps 2‑3 a few times (typically 3‑5 iterations). Each pass enforces Bellman consistency on a finer scale, akin to a one‑dimensional fitted value‑iteration.
  5. Output the final calibrated value function, which can be used for policy evaluation or improvement.

The whole pipeline is post‑hoc: you train your usual offline RL model, then run IBC as a separate calibration stage—no retraining of the underlying representation is needed.

Results & Findings

  • Synthetic MDP experiments (tabular and continuous) show that IBC reduces mean‑squared error of value estimates by 30‑50 % compared with the raw estimator, even when the base model is severely misspecified.
  • Deep offline RL benchmarks (e.g., D4RL locomotion and Atari) demonstrate consistent gains in policy evaluation accuracy and modest improvements in policy performance after a single policy‑improvement step using the calibrated values.
  • Theoretical analysis proves that after (K) calibration iterations, the calibration error shrinks at a rate roughly (O(1/\sqrt{n})) (where (n) is the dataset size) without requiring the value class to be closed under the Bellman operator.
  • Ablation studies confirm that the doubly robust pseudo‑outcome is crucial: using plain importance‑weighted targets leads to higher variance and weaker calibration.

Practical Implications

  • Plug‑and‑play upgrade: Teams can add IBC to existing offline RL pipelines (e.g., CQL, BCQ, Fitted Q‑Iteration) without redesigning model architectures.
  • Safer policy evaluation: More reliable value estimates reduce the risk of deploying policies that look good on paper but perform poorly in the real world—a key concern for finance, robotics, and healthcare.
  • Lower data requirements: Because IBC does not rely on Bellman completeness, it works well even when the dataset is limited or highly biased, extending the reach of offline RL to domains where collecting exhaustive data is infeasible.
  • Interpretability boost: Calibration aligns predicted returns with observed one‑step returns, making the value function easier to audit and debug for engineers.
  • Potential for online fine‑tuning: Although designed for offline settings, the iterative calibration loop could be adapted to online RL as a periodic “value‑function sanity check,” improving stability in non‑stationary environments.

Limitations & Future Work

  • Computational overhead: Each calibration iteration adds a regression pass over the dataset; while cheap for tabular or moderate‑size data, it can become noticeable for massive replay buffers.
  • Choice of calibration method: Histogram binning requires a binning scheme; isotonic regression can be sensitive to noise. Automated selection or adaptive binning remains an open question.
  • Policy improvement coupling: The paper focuses on value calibration; integrating IBC tightly with policy‑optimization steps (e.g., actor‑critic updates) could yield larger performance gains but needs careful stability analysis.
  • Extension to stochastic policies: Current theory assumes a deterministic target policy; extending guarantees to stochastic policies and multi‑step horizons is a promising direction.

Authors

  • Lars van der Laan
  • Nathan Kallus

Paper Information

  • arXiv ID: 2512.23694v1
  • Categories: stat.ML, cs.LG, econ.EM
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »