[Paper] GraphBench: Next-generation graph learning benchmarking

Published: (December 4, 2025 at 12:30 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04475v1

Overview

GraphBench is a new, open‑source benchmarking suite that brings order to the chaotic landscape of graph‑machine‑learning (GML) evaluation. By unifying datasets, splits, metrics, and hyper‑parameter tuning across node‑, edge‑, graph‑level, and generative tasks, it gives developers a single, reproducible playground for testing and comparing GML models—from classic message‑passing networks to the latest graph transformers.

Key Contributions

  • Unified benchmark collection covering multiple domains (e.g., chemistry, social networks, chip design) and four fundamental task families (node, edge, graph, generative).
  • Standardized evaluation protocol with fixed train/validation/test splits, out‑of‑distribution (OOD) test sets, and a common set of performance metrics (accuracy, ROC‑AUC, MAE, etc.).
  • Integrated hyper‑parameter tuning framework that runs a fair, automated search for each model‑dataset pair, eliminating “hand‑tuned” bias.
  • Reference baselines for both message‑passing neural networks (MPNNs) and graph transformer architectures, complete with reproducible training scripts and logs.
  • Extensible design that lets the community add new datasets, tasks, or model families while preserving the core evaluation guarantees.

Methodology

  1. Dataset Curation – The authors gathered 30+ publicly available graph datasets spanning chemistry (e.g., OGB‑MolPCBA), social media (e.g., Reddit), recommendation (e.g., MovieLens), and hardware design (e.g., circuit netlists). Each dataset is pre‑processed into a canonical format (edge list + node/edge features).
  2. Task Definition – For every dataset, the appropriate prediction task is defined (node classification, link prediction, graph classification, or graph generation). The suite automatically generates OOD splits by time‑based or structural perturbations to test generalization.
  3. Evaluation Protocol – All experiments use the same random seeds, early‑stopping criteria, and evaluation metrics. Results are reported as mean ± std over 5 runs.
  4. Hyper‑parameter Search – A lightweight Bayesian optimizer (Tree‑structured Parzen Estimator) runs a fixed budget (e.g., 50 trials) per model‑dataset pair, searching over learning rate, hidden dimension, dropout, and number of layers. The best configuration is then evaluated on the test split.
  5. Baseline Models – Two families are implemented: (a) classic MPNNs (GCN, GAT, GraphSAGE) and (b) graph transformers (GT, SAN). Both are trained with the same optimizer (AdamW) and loss functions appropriate to the task.

Results & Findings

  • Performance Gap – Graph transformers consistently outperform MPNNs on tasks with long‑range dependencies (e.g., molecular property prediction on OGB‑MolPCBA) but offer marginal gains on highly local tasks (e.g., citation node classification).
  • OOD Robustness – Models tuned on the unified protocol show a 10‑15 % drop in accuracy when evaluated on OOD splits, highlighting the importance of evaluating generalization beyond random splits.
  • Hyper‑parameter Sensitivity – The automated search reveals that learning rate and depth are the most critical knobs across all tasks, while dropout matters mainly for generative models.
  • Reproducibility – All baseline numbers are reproducible with a single command (graphbench run <model> <dataset>), and the reported variance is low (≤ 0.02 MAE for most regression tasks).

Practical Implications

  • Faster Model Development – Developers can plug their own GNN implementation into GraphBench and obtain a fair comparison against state‑of‑the‑art baselines without building custom data pipelines.
  • Better Generalization Checks – The built‑in OOD splits encourage teams to test whether a model will hold up when the graph structure drifts (e.g., new chip designs or emerging social networks).
  • Benchmark‑Driven Hiring & Procurement – Companies can use the standardized scores to benchmark vendor‑supplied GNN solutions, making procurement decisions more data‑driven.
  • Accelerated Research‑to‑Product Cycle – By exposing a single source of truth for performance, GraphBench reduces the “benchmark‑gaming” overhead that often stalls production deployments of GNNs.

Limitations & Future Work

  • Dataset Coverage – While diverse, the current suite still lacks large‑scale dynamic graphs (e.g., streaming social feeds) and multimodal graph data (e.g., vision‑language graphs).
  • Compute Budget – The default hyper‑parameter budget (≈ 50 trials) may be insufficient for very deep transformer variants, potentially under‑estimating their true performance.
  • Generative Evaluation – Metrics for graph generation (e.g., Fréchet Graph Distance) are still evolving; the authors note that more robust, task‑specific measures are needed.
  • Future Directions – The authors plan to add reinforcement‑learning‑based graph construction tasks, expand OOD split strategies, and integrate hardware‑accelerated training pipelines (e.g., GPU‑TensorRT, IPU).

Ready to try it out? Visit the live demo at www.graphbench.io and start benchmarking your next graph‑learning model today.

Authors

  • Timo Stoll
  • Chendi Qian
  • Ben Finkelshtein
  • Ali Parviz
  • Darius Weber
  • Fabrizio Frasca
  • Hadar Shavit
  • Antoine Siraudin
  • Arman Mielke
  • Marie Anastacio
  • Erik Müller
  • Maya Bechler‑Speicher
  • Michael Bronstein
  • Mikhail Galkin
  • Holger Hoos
  • Mathias Niepert
  • Bryan Perozzi
  • Jan Tönshoff
  • Christopher Morris

Paper Information

  • arXiv ID: 2512.04475v1
  • Categories: cs.LG, cs.AI, cs.NE, stat.ML
  • Published: December 4, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »