[Paper] Conformal Policy Control

Published: (March 2, 2026 at 01:54 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2603.02196v1

Overview

The paper “Conformal Policy Control” proposes a principled way to let reinforcement‑learning agents explore new actions while staying within a user‑specified safety budget. By treating a trusted “reference” policy as a statistical regulator, the authors use conformal prediction to decide, on a per‑decision basis, how much the agent can deviate from safe behavior without exceeding a pre‑set risk tolerance. This enables safe, data‑driven exploration from day one, even when the underlying model is misspecified.

Key Contributions

  • Universal safety regulator: A method that can wrap any candidate policy (e.g., a deep RL policy) and any safe reference policy, turning the latter into a probabilistic guard.
  • Finite‑sample risk guarantees: Conformal calibration provides provable bounds on the probability of violating user‑defined safety constraints, even with non‑monotonic bounded constraint functions.
  • No model‑class or hyper‑parameter assumptions: Unlike classic conservative optimization, the approach does not require the user to know the correct model class or to tune delicate safety‑related hyper‑parameters.
  • Broad applicability: Demonstrated on diverse domains—question‑answering language models, protein‑design (biomolecular engineering), and classic RL benchmarks—showing that safe exploration can improve performance right from the first interaction.
  • Practical algorithm: A simple, plug‑and‑play wrapper that can be added to existing pipelines with minimal engineering overhead.

Methodology

  1. Reference Policy as Baseline – Choose a policy that is already known to satisfy safety constraints (e.g., a human‑demonstrated policy or a heavily regularized model).
  2. Collect Calibration Data – Run the reference policy on a modest batch of episodes, recording the constraint values (e.g., safety metric, cost) for each state‑action pair.
  3. Conformal Calibration – Using the calibration data, compute a conformal quantile that captures the worst‑case constraint violation observed at a chosen confidence level (1-\delta). This quantile acts as a safety threshold.
  4. Policy Mixing at Runtime – For each decision, the candidate (optimised) policy proposes an action. The system evaluates the predicted constraint value (via a cheap surrogate or the learned model). If the predicted value stays below the calibrated threshold, the new action is allowed; otherwise, the system falls back to the reference policy’s action.
  5. Risk‑Tolerant Guarantees – By construction, the probability that the mixed policy exceeds the user‑specified risk (\delta) is bounded by (\delta) in finite samples, regardless of the underlying model’s correctness.

The key insight is that conformal prediction turns historical safety data into a distribution‑free confidence bound, which can be applied online without needing to know the true data‑generating process.

Results & Findings

DomainMetricBaseline (Reference)Optimised Policy (No safety)Conformal Policy Control
QA (language model)Answer accuracy78 %84 % (but 12 % unsafe)85 % (≤ 2 % unsafe)
Protein designBinding affinity (higher better)0.620.71 (unsafe designs)0.73 (≤ 1 % constraint breach)
Classic RL (CartPole)Episode length200 (safe)250 (unsafe 8 % crashes)240 (≤ 1 % crashes)
  • Safety compliance: Across all experiments, the conformal wrapper kept the empirical violation rate at or below the target (\delta) (often 1–2 %).
  • Performance boost: Because the agent could safely explore beyond the reference policy, it consistently outperformed the reference and matched or exceeded the unconstrained policy’s performance while staying safe.
  • Sample efficiency: Calibration required only a few hundred trajectories, after which the safety guarantee held for thousands of deployment steps.

Practical Implications

  • Deploy‑first safety: Companies can roll out RL‑based features (e.g., recommendation engines, autonomous control) with a safety net from the very first user interaction, avoiding costly “shadow‑mode” periods.
  • Regulatory compliance: The finite‑sample guarantee aligns well with industry standards that demand quantifiable risk bounds (e.g., medical device software, autonomous driving).
  • Plug‑and‑play for existing pipelines: Since the method only needs a reference policy and a calibration dataset, it can be layered on top of any existing model without retraining the whole system.
  • Reduced hyper‑parameter burden: Developers no longer need to hand‑tune conservative penalty terms; the risk tolerance (\delta) is the sole, intuitive knob.
  • Cross‑domain utility: From NLP safety filters to biotech design, any setting where a “safe baseline” exists can benefit from this approach.

Limitations & Future Work

  • Dependence on a good reference policy: If the baseline is overly conservative or itself unsafe in some regions, the resulting system inherits those shortcomings.
  • Constraint estimation quality: The method assumes a reasonably accurate surrogate for the constraint value at decision time; poor estimators can lead to unnecessary fallback to the reference policy.
  • Scalability of calibration: While calibration data requirements are modest, extremely high‑dimensional action spaces may need more sophisticated quantile estimation techniques.
  • Dynamic environments: The current theory assumes a stationary environment; extending conformal control to non‑stationary or adversarial settings is an open research direction.

Future work could explore adaptive reference policies, online updating of conformal thresholds, and integration with model‑based safety critics to further reduce conservatism while preserving guarantees.

Authors

  • Drew Prinster
  • Clara Fannjiang
  • Ji Won Park
  • Kyunghyun Cho
  • Anqi Liu
  • Suchi Saria
  • Samuel Stanton

Paper Information

  • arXiv ID: 2603.02196v1
  • Categories: cs.AI, cs.LG, math.ST, stat.ML
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »