[Paper] GraphBench: Next-generation graph learning benchmarking

Published: 2 months ago (December 4, 2025 at 12:30 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04475v1

Overview

GraphBench is a new, open‑source benchmarking suite that brings order to the chaotic landscape of graph‑machine‑learning (GML) evaluation. By unifying datasets, splits, metrics, and hyper‑parameter tuning across node‑, edge‑, graph‑level, and generative tasks, it gives developers a single, reproducible playground for testing and comparing GML models—from classic message‑passing networks to the latest graph transformers.

Key Contributions

Unified benchmark collection covering multiple domains (e.g., chemistry, social networks, chip design) and four fundamental task families (node, edge, graph, generative).
Standardized evaluation protocol with fixed train/validation/test splits, out‑of‑distribution (OOD) test sets, and a common set of performance metrics (accuracy, ROC‑AUC, MAE, etc.).
Integrated hyper‑parameter tuning framework that runs a fair, automated search for each model‑dataset pair, eliminating “hand‑tuned” bias.
Reference baselines for both message‑passing neural networks (MPNNs) and graph transformer architectures, complete with reproducible training scripts and logs.
Extensible design that lets the community add new datasets, tasks, or model families while preserving the core evaluation guarantees.

Methodology

Dataset Curation – The authors gathered 30+ publicly available graph datasets spanning chemistry (e.g., OGB‑MolPCBA), social media (e.g., Reddit), recommendation (e.g., MovieLens), and hardware design (e.g., circuit netlists). Each dataset is pre‑processed into a canonical format (edge list + node/edge features).
Task Definition – For every dataset, the appropriate prediction task is defined (node classification, link prediction, graph classification, or graph generation). The suite automatically generates OOD splits by time‑based or structural perturbations to test generalization.
Evaluation Protocol – All experiments use the same random seeds, early‑stopping criteria, and evaluation metrics. Results are reported as mean ± std over 5 runs.
Hyper‑parameter Search – A lightweight Bayesian optimizer (Tree‑structured Parzen Estimator) runs a fixed budget (e.g., 50 trials) per model‑dataset pair, searching over learning rate, hidden dimension, dropout, and number of layers. The best configuration is then evaluated on the test split.
Baseline Models – Two families are implemented: (a) classic MPNNs (GCN, GAT, GraphSAGE) and (b) graph transformers (GT, SAN). Both are trained with the same optimizer (AdamW) and loss functions appropriate to the task.

Results & Findings

Performance Gap – Graph transformers consistently outperform MPNNs on tasks with long‑range dependencies (e.g., molecular property prediction on OGB‑MolPCBA) but offer marginal gains on highly local tasks (e.g., citation node classification).
OOD Robustness – Models tuned on the unified protocol show a 10‑15 % drop in accuracy when evaluated on OOD splits, highlighting the importance of evaluating generalization beyond random splits.
Hyper‑parameter Sensitivity – The automated search reveals that learning rate and depth are the most critical knobs across all tasks, while dropout matters mainly for generative models.
Reproducibility – All baseline numbers are reproducible with a single command (graphbench run <model> <dataset>), and the reported variance is low (≤ 0.02 MAE for most regression tasks).

Practical Implications

Faster Model Development – Developers can plug their own GNN implementation into GraphBench and obtain a fair comparison against state‑of‑the‑art baselines without building custom data pipelines.
Better Generalization Checks – The built‑in OOD splits encourage teams to test whether a model will hold up when the graph structure drifts (e.g., new chip designs or emerging social networks).
Benchmark‑Driven Hiring & Procurement – Companies can use the standardized scores to benchmark vendor‑supplied GNN solutions, making procurement decisions more data‑driven.
Accelerated Research‑to‑Product Cycle – By exposing a single source of truth for performance, GraphBench reduces the “benchmark‑gaming” overhead that often stalls production deployments of GNNs.

Limitations & Future Work

Dataset Coverage – While diverse, the current suite still lacks large‑scale dynamic graphs (e.g., streaming social feeds) and multimodal graph data (e.g., vision‑language graphs).
Compute Budget – The default hyper‑parameter budget (≈ 50 trials) may be insufficient for very deep transformer variants, potentially under‑estimating their true performance.
Generative Evaluation – Metrics for graph generation (e.g., Fréchet Graph Distance) are still evolving; the authors note that more robust, task‑specific measures are needed.
Future Directions – The authors plan to add reinforcement‑learning‑based graph construction tasks, expand OOD split strategies, and integrate hardware‑accelerated training pipelines (e.g., GPU‑TensorRT, IPU).

Ready to try it out? Visit the live demo at www.graphbench.io and start benchmarking your next graph‑learning model today.

Authors

Timo Stoll
Chendi Qian
Ben Finkelshtein
Ali Parviz
Darius Weber
Fabrizio Frasca
Hadar Shavit
Antoine Siraudin
Arman Mielke
Marie Anastacio
Erik Müller
Maya Bechler‑Speicher
Michael Bronstein
Mikhail Galkin
Holger Hoos
Mathias Niepert
Bryan Perozzi
Jan Tönshoff
Christopher Morris

Paper Information

arXiv ID: 2512.04475v1
Categories: cs.LG, cs.AI, cs.NE, stat.ML
Published: December 4, 2025
PDF: Download PDF

[Paper] GraphBench: Next-generation graph learning benchmarking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement