[Paper] DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

Published: 3 days ago (February 22, 2026 at 11:23 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.19261v1

Overview

The paper introduces Directed Graph Policy Optimization (DGPO), a novel framework that combines reinforcement‑learning (RL) fine‑tuning with discrete graph diffusion to generate directed acyclic graphs (DAGs) representing neural network architectures. By explicitly handling edge directionality—something prior diffusion models for graphs ignored—DGPO can steer the generative process toward high‑performing architectures and even extrapolate beyond the data it was trained on.

Key Contributions

Direction‑aware diffusion: Extends discrete graph diffusion to DAGs using topological node ordering and positional encodings, preserving data‑flow semantics.
RL‑steered generation: Applies policy‑gradient RL to fine‑tune the diffusion model toward a reward (e.g., validation accuracy) while keeping the underlying generative distribution intact.
Transferable structural priors: Shows that a model pretrained on only 7 % of a NAS benchmark’s search space can later generate near‑oracle architectures after RL fine‑tuning.
Strong empirical results: Matches or exceeds the best known scores on NAS‑Bench‑101 and all three NAS‑Bench‑201 tasks (91.61 %, 73.49 %, 46.77 %).
Bidirectional control experiments: Demonstrates genuine reward‑driven steering—optimizing for the opposite objective collapses performance to random‑chance levels.

Methodology

Pre‑training a discrete graph diffusion model on a large pool of random DAGs from a NAS benchmark. The diffusion process learns to “denoise” a corrupted graph back to a valid architecture.
Encoding directionality:
- Topological ordering guarantees that every edge points from a lower‑rank node to a higher‑rank node, enforcing acyclicity.
- Positional encodings (similar to those used in Transformers) are added to node features so the diffusion network can differentiate upstream vs. downstream nodes.
RL fine‑tuning (DGPO):
- Treat the diffusion model as a stochastic policy that samples a candidate architecture.
- Compute a reward (e.g., validation accuracy on a proxy dataset).
- Apply a policy‑gradient update (REINFORCE with a baseline) to increase the likelihood of high‑reward graphs while preserving the diffusion prior.
Evaluation: Sample thousands of architectures from the fine‑tuned model, evaluate them on the benchmark, and compare against oracle and baseline methods.

Results & Findings

Benchmark	Metric (higher is better)	DGPO (full data)	DGPO (7 % pre‑train)	Oracle / Best Known
NAS‑Bench‑201 (CIFAR‑10)	Accuracy %	91.61	91.29 (‑0.32)	91.61
NAS‑Bench‑201 (CIFAR‑100)	Accuracy %	73.49	73.20 (‑0.29)	73.49
NAS‑Bench‑201 (ImageNet‑16‑120)	Accuracy %	46.77	46.44 (‑0.33)	46.77

Transferability: With only 7 % of the search space seen during pre‑training, DGPO still reaches within 0.32 % of the full‑data performance, showing that the diffusion model learns reusable architectural motifs.
Extrapolation: After RL fine‑tuning, DGPO surpasses the performance ceiling of the pre‑trained model by ~7.3 %, indicating that the RL step discovers novel, high‑quality structures not present in the original training set.
Control experiment: When the reward is inverted (i.e., the model is trained to minimize accuracy), the generated architectures collapse to near‑random performance (~9.5 % accuracy), confirming that improvements stem from reward‑driven steering rather than a biased diffusion prior.

Practical Implications

Accelerated NAS pipelines: Developers can pre‑train a compact diffusion model on a modest subset of a search space and later fine‑tune it with RL on a specific hardware or latency budget, dramatically cutting the number of expensive full‑training evaluations.
Domain‑agnostic generative design: The direction‑aware diffusion framework can be repurposed for any combinatorial design problem where edge direction matters (e.g., data‑flow pipelines, compiler optimization graphs, circuit synthesis).
Plug‑and‑play reward functions: Since DGPO treats the diffusion model as a policy, any differentiable or black‑box metric (energy consumption, FLOPs, latency, robustness) can be swapped in without redesigning the generator.
Reduced carbon footprint: By needing fewer full‑training runs, organizations can lower the compute cost and associated emissions of large‑scale NAS campaigns.

Limitations & Future Work

Scalability to larger search spaces: Experiments are limited to NAS‑Bench‑101/201 (≤ 10⁶ architectures). Extending DGPO to industrial‑scale NAS (billions of candidates) may require hierarchical diffusion or memory‑efficient encodings.
Reward latency: RL fine‑tuning still depends on evaluating sampled architectures, which can be a bottleneck for expensive training regimes; surrogate predictors or weight‑sharing could mitigate this.
Generalization beyond DAGs: While the method handles DAGs well, many real‑world graphs contain cycles (e.g., recurrent networks). Adapting the topological ordering trick to cyclic graphs remains an open challenge.
Theoretical guarantees: The paper provides empirical evidence of reward steering but lacks formal convergence or optimality proofs for the combined diffusion‑RL system.

DGPO bridges the gap between powerful generative diffusion models and the precise control needed for neural architecture search, offering a practical toolset for developers who want to harness AI‑driven design without drowning in compute‑heavy searches.

Authors

Aleksei Liuliakov
Luca Hermes
Barbara Hammer

Paper Information

arXiv ID: 2602.19261v1
Categories: cs.LG, cs.AI, cs.NE
Published: February 22, 2026
PDF: Download PDF

[Paper] DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Test-Time Training with KV Binding Is Secretly Linear Attention

[Paper] Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

[Paper] Aletheia tackles FirstProof autonomously

[Paper] Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs