[Paper] Symphony: A Heuristic Normalized Calibrated Advantage Actor and Critic Algorithm in application for Humanoid Robots
Source: arXiv - 2512.10477v1
Overview
The paper introduces Symphony, a new reinforcement‑learning (RL) algorithm that blends actor‑critic ideas with several safety‑focused tricks to train humanoid robots from scratch in a sample‑efficient and mechanically gentle way. By constraining noise, shaping replay, and using a “temporal advantage” signal, the authors claim they can achieve stable learning in far fewer steps than classic methods while protecting the robot’s hardware.
Key Contributions
- Swaddling regularization – a penalty on action magnitude that keeps early‑stage motions low‑energy without directly limiting the policy’s expressive power.
- Fading Replay Buffer – a hyperbolic‑tangent‑based sampling scheme that balances recent experiences with long‑term ones, improving both exploration and stability.
- Temporal Advantage – a single‑pass advantage estimate that compares the current critic’s prediction against its exponential moving average, enabling simultaneous actor‑critic updates.
- Deterministic policy with bounded parametric noise – instead of unrestricted Gaussian noise, the algorithm injects a limited, smoothly decaying noise term, reducing wear on motors and gearboxes.
- Unified Actor‑Critic object – the loss functions for both networks are expressed in a single line of code, simplifying implementation and debugging.
Methodology
-
Base Architecture – Symphony builds on the deterministic actor‑critic framework (similar to DDPG/TD3) where a policy network outputs continuous joint commands and a critic estimates the Q‑value.
-
Swaddling Regularizer – during training, an extra loss term penalizes the L2‑norm of actions, scaled by a schedule that gradually relaxes as learning progresses. This “swaddles” the robot, preventing high‑torque spikes early on.
-
Fading Replay Buffer – each transition is stored with a timestamp. When sampling a minibatch, the probability (p(t)) of picking an experience from time (t) follows
[ p(t) = \frac{1}{2}\bigl[1 + \tanh\bigl(\alpha (t - \beta)\bigr)\bigr], ]
where (\alpha) controls the steepness and (\beta) shifts the focus toward recent data while still retaining older, informative samples.
-
Temporal Advantage – instead of the classic TD‑error, the algorithm computes
[ A_{\text{temp}} = Q_{\theta}(s,a) - \text{EMA}\bigl(Q_{\theta}(s,a)\bigr), ]
where EMA is an exponential moving average of the critic’s own predictions. This captures whether the current critic is improving and feeds directly into both actor and critic loss terms.
-
Bounded Noise Injection – action noise is drawn from a truncated Gaussian whose variance decays with training iteration, ensuring that early exploration stays within safe torque limits.
-
One‑Pass Update – because the temporal advantage already contains the TD‑error information, the actor and critic can be updated in a single gradient step, reducing wall‑clock time.
Results & Findings
| Metric | Symphony | TD3 (baseline) | SAC (baseline) |
|---|---|---|---|
| Sample efficiency (steps to 0.8 success) | 1.2 M | 3.8 M | 4.5 M |
| Average joint torque (Nm) during early training | 0.35 × baseline | 1.00 × | 0.92 × |
| Final success rate on HumanoidStand‑Up task | 93 % | 81 % | 85 % |
| Training wall‑time (GPU + real‑robot) | 6 h | 14 h | 12 h |
- Sample Efficiency – Symphony reaches high success rates with roughly 3‑4× fewer environment steps than popular stochastic algorithms.
- Safety – The swaddling term keeps torque commands low in the first 500 k steps, dramatically reducing wear on servos and gearboxes.
- Stability – The fading replay buffer mitigates catastrophic forgetting; performance curves are smoother with fewer spikes.
Practical Implications
- Faster Prototyping – Robotics teams can iterate on new locomotion or manipulation policies without waiting for weeks of simulation or risky real‑world trials.
- Hardware Longevity – By limiting early‑stage torque, manufacturers can run longer continuous training sessions on the same physical robot without premature wear.
- Simplified Codebases – The unified actor‑critic object and single‑line loss definitions make it easy to drop Symphony into existing PyTorch/TensorFlow pipelines.
- Safety‑First RL – The approach offers a template for other domains (e.g., drones, exoskeletons) where aggressive exploration could cause damage.
Limitations & Future Work
- Domain Specificity – The experiments focus on a single humanoid platform; transfer to other morphologies (quadrupeds, manipulators) remains untested.
- Hyper‑parameter Sensitivity – The swaddling schedule and fading buffer parameters require careful tuning; the authors note performance drops if the decay is too fast.
- Simulation‑to‑Real Gap – While the paper includes real‑robot runs, most benchmarks are still in simulation, leaving open questions about robustness to sensor noise and latency.
- Future Directions – The authors plan to (1) automate the schedule of the swaddling regularizer via meta‑learning, (2) explore multi‑agent extensions, and (3) integrate model‑based predictions to further cut sample counts.
Authors
- Timur Ishuov
- Michele Folgheraiter
- Madi Nurmanov
- Goncalo Gordo
- Richárd Farkas
- József Dombi
Paper Information
- arXiv ID: 2512.10477v1
- Categories: cs.RO, cs.NE
- Published: December 11, 2025
- PDF: Download PDF