[Paper] Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Published: 1 month ago (December 17, 2025 at 01:44 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15687v1

Overview

A new reinforcement‑learning (RL) framework called G2RL (Gradient‑Guided Reinforcement Learning) lets large language models (LLMs) steer their own exploration using the gradients they would generate during training. By rewarding sample trajectories that point the model’s parameters in novel directions, G2RL produces more diverse and effective reasoning behavior than traditional entropy bonuses or external similarity metrics. The authors demonstrate consistent gains on a suite of math‑and‑reasoning benchmarks using 1.7 B‑ and 4 B‑parameter Qwen‑3 models.

Key Contributions

Self‑referential exploration signal – Uses the model’s own first‑order update geometry (gradient features) to decide which sampled responses are worth exploring.
Bounded multiplicative reward scaler – Trajectories that introduce orthogonal or opposing gradient directions receive a boost, while redundant ones are down‑weighted.
Compatibility with PPO/KL‑control – The gradient‑based reward integrates cleanly with standard PPO stability mechanisms, avoiding the instability often seen with external heuristics.
Empirical validation across diverse reasoning tasks – Shows improvements on MATH500, AMC, AIME24/25, GPQA, and MMLU‑pro, measured by pass@1, maj@16, and pass@k.
Geometric analysis of exploration – Demonstrates that G2RL expands the policy’s update space into more orthogonal directions without sacrificing semantic coherence.

Methodology

Forward Pass Feature Extraction – For each candidate response, the model’s final hidden layer is inspected to compute a sensitivity vector (the Jacobian of the output logits w.r.t. the hidden activations). This costs essentially nothing beyond the normal forward pass.
Gradient‑Based Similarity – Within a batch of sampled trajectories, the pairwise cosine similarity of these sensitivity vectors is calculated. Low similarity means the trajectories would push the model’s parameters in different directions.
Reward Scaling – A bounded multiplicative factor (e.g., 1 ± α·(1 – similarity)) is applied to the usual RL reward (e.g., correctness score). High‑novelty trajectories get a larger factor, low‑novelty ones get a smaller factor.
PPO Update – The scaled rewards are fed into a standard Proximal Policy Optimization loop with KL‑penalty, ensuring stable learning.
Iterative Sampling – The process repeats, continually reshaping the policy toward regions of the parameter space that have not yet been explored.

Results & Findings

Benchmark	Baseline (entropy‑GRPO)	G2RL (1.7 B)	G2RL (4 B)
MATH500 (pass@1)	22.3 %	27.9 %	34.5 %
AMC (maj@16)	41.8 %	48.2 %	55.6 %
AIME24 (pass@k)	18.7 %	24.3 %	30.1 %
GPQA (pass@1)	35.4 %	41.0 %	46.8 %
MMLU‑pro (pass@1)	62.1 %	68.9 %	74.3 %

Orthogonal Gradient Expansion: The average cosine similarity between sampled trajectories dropped from ~0.68 (entropy) to ~0.31 (G2RL), indicating more diverse update directions.
Semantic Coherence Preserved: Human evaluation showed no increase in nonsensical outputs; the model still respects the prompt context.
Training Overhead: Adding the gradient‑feature computation adds < 2 % runtime overhead per PPO iteration.

Practical Implications

Better Reasoning Agents: Developers building LLM‑powered tutoring systems, code assistants, or scientific assistants can achieve higher correctness with fewer fine‑tuning steps.
Reduced Need for Hand‑Crafted Exploration Bonuses: Teams can drop entropy‑based tricks and rely on the model’s own geometry, simplifying the RL pipeline.
Scalable to Larger Models: Since the feature extraction is cheap, the approach scales to multi‑billion‑parameter models without prohibitive compute costs.
More Efficient Data Usage: By encouraging truly novel updates, G2RL can extract more learning signal from the same amount of annotated or self‑generated data, lowering annotation budgets.
Potential for Continual Learning: The gradient‑guided signal could be repurposed for on‑device adaptation where stability (KL control) is critical.

Limitations & Future Work

Gradient Approximation Quality: The method relies on first‑order sensitivities; higher‑order effects (e.g., curvature) are ignored and could further refine exploration.
Batch Size Sensitivity: The novelty reward depends on the diversity within a sampled batch; very small batches may yield noisy scaling.
Domain Transfer: Experiments focus on math and general reasoning; it remains to be seen how G2RL performs on dialogue, retrieval‑augmented generation, or multimodal tasks.
Theoretical Guarantees: While empirical orthogonality improves, formal convergence or optimality guarantees under gradient‑guided rewards are still open questions.

Overall, G2RL offers a compelling, low‑overhead way for LLM developers to let the model’s own learning dynamics drive smarter exploration, paving the way for more capable and data‑efficient reasoning systems.

Authors

Zhenwen Liang
Sidi Lu
Wenhao Yu
Kishan Panaganti
Yujun Zhou
Haitao Mi
Dong Yu

Paper Information

arXiv ID: 2512.15687v1
Categories: cs.LG, cs.AI
Published: December 17, 2025
PDF: Download PDF

[Paper] Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy