[Paper] Unified Policy Value Decomposition for Rapid Adaptation
Source: arXiv - 2603.17947v1
Overview
The paper proposes a new reinforcement‑learning architecture that lets agents adapt instantly to new tasks by sharing a compact “goal embedding” between the policy and value networks. By learning a set of reusable basis functions during pre‑training, the agent can handle novel objectives (e.g., new movement directions) with a single forward pass—no extra gradient updates required.
Key Contributions
- Bilinear Actor‑Critic Decomposition – Factorizes the Q‑function as a sum of value bases multiplied by goal‑dependent coefficients, and mirrors this structure in the policy network.
- Shared Low‑Dimensional Goal Embedding – A single coefficient vector
G(g)captures task identity for both actor and critic, enabling zero‑shot adaptation. - Biologically Inspired Gain Modulation – The multiplicative gating resembles how top‑down signals modulate pyramidal neuron responses, offering a plausible neural analogy.
- Zero‑Shot Transfer on MuJoCo Ant – Demonstrates immediate adaptation to unseen locomotion directions by interpolating in the learned goal space.
- Extension of Successor Features – Generalizes the successor‑feature idea from value‑based RL to the policy side, creating “primitive policies” that are recombined on the fly.
Methodology
-
Pre‑training Phase
- Train a Soft Actor‑Critic (SAC) agent on a multi‑goal version of the Ant environment, where each task is defined by a continuous goal vector
g(e.g., a direction to walk). - Learn value bases
y_k(s,a)and policy basesπ_k(a|s)that are task‑agnostic; they capture generic dynamics of the robot.
- Train a Soft Actor‑Critic (SAC) agent on a multi‑goal version of the Ant environment, where each task is defined by a continuous goal vector
-
Bilinear Factorization
- Critic:
Q(s,a,g) = Σ_k G_k(g) · y_k(s,a) - Actor:
π(a|s,g) = Σ_k G_k(g) · π_k(a|s) G(g) ∈ ℝ^Kis a low‑dimensional embedding produced by a small goal‑encoder network.
- Critic:
-
Zero‑Shot Adaptation
- Freeze all bases after pre‑training.
- For a new goal
g', computeG(g')with a single forward pass and combine the frozen bases to obtain the new policy/value instantly.
-
Evaluation
- Test on unseen directions (interpolated and extrapolated beyond the eight training headings).
- Compare against standard SAC (re‑trained per direction) and a multi‑head baseline without shared embeddings.
Results & Findings
| Metric | Standard SAC (re‑trained) | Multi‑head (no sharing) | Bilinear Shared‑Embedding |
|---|---|---|---|
| Success rate on trained directions | 96 % | 94 % | 97 % |
| Success rate on unseen directions | 0 % (needs retraining) | 12 % | 85 % |
| Adaptation latency (ms) | – (gradient steps) | 5 ms | 3 ms (single forward) |
| Parameter overhead vs. vanilla SAC | +12 % | +25 % | +15 % |
- The shared goal embedding interpolates smoothly between known directions, producing sensible locomotion even for angles never seen during training.
- Visualizations of the coefficient space reveal a structured manifold where nearby goals have similar
G(g)values, confirming that the embedding captures task similarity. - Ablation studies show that freezing only the policy bases (or only the value bases) degrades performance, highlighting the importance of joint actor‑critic factorization.
Practical Implications
- Rapid Prototyping of Controllers – Engineers can pre‑train a single model on a family of tasks (e.g., different robot gait patterns) and then deploy it to new objectives without costly on‑device learning.
- Edge‑Device RL – The zero‑shot adaptation requires only a lightweight forward pass, making it suitable for low‑power robots, drones, or IoT actuators that cannot afford iterative gradient updates.
- Modular Policy Libraries – The primitive policy bases act like reusable “skills” that can be recombined on demand, simplifying the construction of hierarchical or compositional agents.
- Transfer Across Sim‑to‑Real Gaps – By learning a goal embedding that abstracts away environment specifics, the same architecture could be fine‑tuned for real‑world hardware with minimal data.
- Neuro‑Inspired Design – The gain‑modulation mechanism offers a concrete blueprint for building RL systems that mimic cortical processing, potentially improving robustness and interpretability.
Limitations & Future Work
- Scalability of Basis Count – The number of bases
Kmust be chosen manually; too few limit expressivity, too many increase memory and inference cost. - Goal Representation Simplicity – Experiments used low‑dimensional continuous vectors; extending to high‑dimensional or symbolic goals (e.g., language commands) remains open.
- Generalization Beyond Interpolation – While interpolation works well, extrapolation to drastically different dynamics (e.g., new robot morphologies) was not evaluated.
- Biological Plausibility vs. Engineering Trade‑offs – The gain‑modulation analogy is intriguing but not rigorously tested against neurophysiological data.
Future research could explore automatic basis discovery, hierarchical embeddings for multi‑modal goals, and real‑world deployments on physical robots to validate zero‑shot adaptation under noisy sensors and actuation.
Authors
- Cristiano Capone
- Luca Falorsi
- Andrea Ciardiello
- Luca Manneschi
Paper Information
- arXiv ID: 2603.17947v1
- Categories: cs.LG, q-bio.NC
- Published: March 18, 2026
- PDF: Download PDF