[Paper] Unified Policy Value Decomposition for Rapid Adaptation

Published: 2 days ago (March 18, 2026 at 01:19 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.17947v1

Overview

The paper proposes a new reinforcement‑learning architecture that lets agents adapt instantly to new tasks by sharing a compact “goal embedding” between the policy and value networks. By learning a set of reusable basis functions during pre‑training, the agent can handle novel objectives (e.g., new movement directions) with a single forward pass—no extra gradient updates required.

Key Contributions

Bilinear Actor‑Critic Decomposition – Factorizes the Q‑function as a sum of value bases multiplied by goal‑dependent coefficients, and mirrors this structure in the policy network.
Shared Low‑Dimensional Goal Embedding – A single coefficient vector G(g) captures task identity for both actor and critic, enabling zero‑shot adaptation.
Biologically Inspired Gain Modulation – The multiplicative gating resembles how top‑down signals modulate pyramidal neuron responses, offering a plausible neural analogy.
Zero‑Shot Transfer on MuJoCo Ant – Demonstrates immediate adaptation to unseen locomotion directions by interpolating in the learned goal space.
Extension of Successor Features – Generalizes the successor‑feature idea from value‑based RL to the policy side, creating “primitive policies” that are recombined on the fly.

Methodology

Pre‑training Phase
- Train a Soft Actor‑Critic (SAC) agent on a multi‑goal version of the Ant environment, where each task is defined by a continuous goal vector g (e.g., a direction to walk).
- Learn value bases y_k(s,a) and policy bases π_k(a|s) that are task‑agnostic; they capture generic dynamics of the robot.
Bilinear Factorization
- Critic: Q(s,a,g) = Σ_k G_k(g) · y_k(s,a)
- Actor: π(a|s,g) = Σ_k G_k(g) · π_k(a|s)
- G(g) ∈ ℝ^K is a low‑dimensional embedding produced by a small goal‑encoder network.
Zero‑Shot Adaptation
- Freeze all bases after pre‑training.
- For a new goal g', compute G(g') with a single forward pass and combine the frozen bases to obtain the new policy/value instantly.
Evaluation
- Test on unseen directions (interpolated and extrapolated beyond the eight training headings).
- Compare against standard SAC (re‑trained per direction) and a multi‑head baseline without shared embeddings.

Results & Findings

Metric	Standard SAC (re‑trained)	Multi‑head (no sharing)	Bilinear Shared‑Embedding
Success rate on trained directions	96 %	94 %	97 %
Success rate on unseen directions	0 % (needs retraining)	12 %	85 %
Adaptation latency (ms)	– (gradient steps)	5 ms	3 ms (single forward)
Parameter overhead vs. vanilla SAC	+12 %	+25 %	+15 %

The shared goal embedding interpolates smoothly between known directions, producing sensible locomotion even for angles never seen during training.
Visualizations of the coefficient space reveal a structured manifold where nearby goals have similar G(g) values, confirming that the embedding captures task similarity.
Ablation studies show that freezing only the policy bases (or only the value bases) degrades performance, highlighting the importance of joint actor‑critic factorization.

Practical Implications

Rapid Prototyping of Controllers – Engineers can pre‑train a single model on a family of tasks (e.g., different robot gait patterns) and then deploy it to new objectives without costly on‑device learning.
Edge‑Device RL – The zero‑shot adaptation requires only a lightweight forward pass, making it suitable for low‑power robots, drones, or IoT actuators that cannot afford iterative gradient updates.
Modular Policy Libraries – The primitive policy bases act like reusable “skills” that can be recombined on demand, simplifying the construction of hierarchical or compositional agents.
Transfer Across Sim‑to‑Real Gaps – By learning a goal embedding that abstracts away environment specifics, the same architecture could be fine‑tuned for real‑world hardware with minimal data.
Neuro‑Inspired Design – The gain‑modulation mechanism offers a concrete blueprint for building RL systems that mimic cortical processing, potentially improving robustness and interpretability.

Limitations & Future Work

Scalability of Basis Count – The number of bases K must be chosen manually; too few limit expressivity, too many increase memory and inference cost.
Goal Representation Simplicity – Experiments used low‑dimensional continuous vectors; extending to high‑dimensional or symbolic goals (e.g., language commands) remains open.
Generalization Beyond Interpolation – While interpolation works well, extrapolation to drastically different dynamics (e.g., new robot morphologies) was not evaluated.
Biological Plausibility vs. Engineering Trade‑offs – The gain‑modulation analogy is intriguing but not rigorously tested against neurophysiological data.

Future research could explore automatic basis discovery, hierarchical embeddings for multi‑modal goals, and real‑world deployments on physical robots to validate zero‑shot adaptation under noisy sensors and actuation.

Authors

Cristiano Capone
Luca Falorsi
Andrea Ciardiello
Luca Manneschi

Paper Information

arXiv ID: 2603.17947v1
Categories: cs.LG, q-bio.NC
Published: March 18, 2026
PDF: Download PDF

[Paper] Unified Policy Value Decomposition for Rapid Adaptation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules