[Paper] Arc Gradient Descent: A Mathematically Derived Reformulation of Gradient Descent with Phase-Aware, User-Controlled Step Dynamics

Published: (December 7, 2025 at 04:03 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.06737v1

Overview

The paper introduces Arc Gradient Descent (ArcGD), a mathematically derived reformulation of classic gradient descent that treats each update as a movement along an “arc” rather than a straight line. By making the step size phase‑aware and giving users direct control over the dynamics of each update, ArcGD aims to tame the erratic behavior of existing optimizers on highly non‑convex landscapes while preserving fast convergence.

Key Contributions

  • Arc‑based reformulation of gradient descent – Derives update rules from first‑principles geometry, interpreting each iteration as a rotation on a hyperspherical surface.
  • Phase‑aware step dynamics – Introduces a user‑controllable “phase factor” that modulates the curvature of the update arc, allowing fine‑grained tuning of exploration vs. exploitation.
  • Comprehensive empirical evaluation
    • Benchmarked on stochastic Rosenbrock functions up to 50 000 dimensions, showing consistent superiority over Adam when both use ArcGD’s effective learning rate.
    • Tested on CIFAR‑10 with eight heterogeneous MLP architectures, achieving the highest average test accuracy (50.7 %) after 20 k iterations.
  • Connection to existing optimizers – Demonstrates that a special case of ArcGD reduces to the Lion optimizer, providing a theoretical bridge between the two families.
  • Open‑source implementation – Provides a lightweight PyTorch‑compatible optimizer that can be dropped into existing training pipelines with a single line change.

Methodology

  1. Geometric Derivation

    • Starts from the standard gradient descent update θ_{t+1} = θ_t - η ∇L(θ_t).
    • Re‑expresses the update as a rotation on a unit hypersphere: θ_{t+1} = R(φ_t) θ_t, where R is a rotation matrix parameterized by a phase angle φ_t.
    • The phase angle is computed from the gradient magnitude and a user‑defined phase schedule (e.g., linear, cosine, or adaptive).
  2. Effective Learning Rate

    • The “effective” step size becomes η_eff = η * sin(φ_t) / φ_t, automatically scaling down large updates in steep regions while preserving momentum in flatter zones.
  3. Implementation Details

    • Integrated as a drop‑in replacement for torch.optim.Optimizer.
    • Supports per‑parameter groups, weight decay, and optional momentum (implemented as a secondary rotation).
  4. Experimental Protocol

    • Synthetic benchmark: Stochastic Rosenbrock function with dimensions {2, 10, 100, 1 000, 50 000}. Two learning‑rate settings were used to isolate the effect of the ArcGD dynamics.
    • Real‑world benchmark: CIFAR‑10 classification using eight MLP variants (1–5 hidden layers, varying widths). All optimizers were run for 20 k iterations; intermediate checkpoints at 5 k and 10 k iterations were recorded.
  5. Evaluation Metrics

    • Final loss / final test accuracy.
    • Convergence speed (iterations to reach 90 % of the final loss).
    • Generalization gap (difference between training and test accuracy).

Results & Findings

SettingOptimizerFinal Test Accuracy (avg.)Early‑stage (5 k iters)Over‑fit Resistance
CIFAR‑10 MLPsArcGD50.7 %44.2 %Improves steadily
AdamW46.6 %48.9 %Peaks early, then degrades
Adam46.8 %49.1 %Same pattern as AdamW
SGD49.6 %42.5 %Slower early, catches up
Lion43.4 %40.3 %Consistently lower
  • Synthetic Rosenbrock: With ArcGD’s effective learning rate, the optimizer reached lower minima across all dimensions, even in the 50 000‑D case where Adam diverged. When both used Adam’s default learning rate, ArcGD was slower initially but still produced superior final solutions in 4/5 dimensionalities.
  • Generalization: ArcGD’s test accuracy kept rising past 10 k iterations, whereas Adam/AdamW plateaued and even regressed, indicating better resistance to over‑fitting without extra regularization or early‑stop tuning.
  • Phase‑schedule impact: A cosine‑decay phase schedule yielded the best trade‑off between exploration (early iterations) and fine‑grained convergence (later iterations).

Practical Implications

  • Plug‑and‑play optimizer for deep‑learning pipelines – Developers can replace Adam with a single line (optimizer = ArcGD(model.parameters(), lr=0.001)) and immediately benefit from more stable long‑run training, especially on tasks prone to over‑fitting.
  • Robustness on high‑dimensional, ill‑conditioned problems – The arc formulation naturally damps oscillations in narrow valleys, making it attractive for training large language models, reinforcement‑learning policies, or scientific‑computing models where curvature can be extreme.
  • Fine‑grained control without hyper‑parameter explosion – The phase schedule replaces the need for separate learning‑rate warm‑up, decay, or cyclical policies; developers can tune a single “phase‑scale” parameter to achieve similar effects.
  • Potential for better generalization – By continuing to improve after the early convergence window, ArcGD reduces reliance on early‑stop heuristics, simplifying hyper‑parameter sweeps for production training jobs.
  • Compatibility with existing tooling – Since ArcGD is built on the PyTorch Optimizer API, it works with mixed‑precision training, distributed data‑parallel, and gradient‑clipping utilities out of the box.

Limitations & Future Work

  • Computational overhead – The rotation‑based update adds a modest (~5‑10 %) per‑step cost compared to Adam, which may be noticeable in ultra‑large models.
  • Hyper‑parameter sensitivity – While the phase schedule consolidates several learning‑rate tricks, choosing an appropriate schedule (linear vs. cosine vs. adaptive) still requires empirical testing.
  • Benchmarks limited to MLPs and a synthetic Rosenbrock – The paper does not evaluate ArcGD on convolutional networks, transformers, or reinforcement‑learning agents, leaving open questions about scalability to those domains.
  • Theoretical convergence guarantees – The authors provide a geometric derivation but defer rigorous proofs of convergence rates in stochastic settings to future work.
  • Future directions – Extending ArcGD to second‑order information (e.g., curvature‑aware phases), integrating with adaptive momentum schemes, and exploring automatic phase‑schedule learning via meta‑optimization.

Authors

  • Nikhil Verma
  • Joonas Linnosmaa
  • Espinosa‑Leal Leonardo
  • Napat Vajragupta

Paper Information

  • arXiv ID: 2512.06737v1
  • Categories: cs.LG, cs.AI, cs.CL, cs.CV, cs.NE
  • Published: December 7, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »