[Paper] ParaFormer: A Generalized PageRank Graph Transformer for Graph Representation Learning

Published: 1 month ago (December 16, 2025 at 12:30 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.14619v1

Overview

The paper introduces ParaFormer, a novel Graph Transformer that tackles the notorious over‑smoothing problem that plagues both deep Graph Neural Networks (GNNs) and earlier Graph Transformers. By weaving a PageRank‑based attention mechanism into the transformer architecture, the authors achieve a model that preserves discriminative node features while still capturing long‑range dependencies—delivering consistent gains on a wide range of graph‑learning benchmarks.

Key Contributions

PageRank‑enhanced attention: A new attention formulation that mimics deep transformer behavior while acting as an adaptive pass‑filter to curb over‑smoothing.
Theoretical analysis: Formal proof that the proposed attention behaves like a low‑pass filter with controllable bandwidth, unlike vanilla global attention which collapses node representations.
Extensive empirical validation: State‑of‑the‑art results on 11 node‑ and graph‑classification datasets, spanning from a few thousand to several million nodes.
Open‑source implementation: Full codebase and reproducibility scripts released on GitHub, facilitating rapid adoption and further research.

Methodology

Problem Diagnosis – The authors first show that global self‑attention in standard Graph Transformers behaves as an aggressive low‑pass filter, causing node embeddings to become indistinguishable (the over‑smoothing effect).
PageRank‑Guided Attention – They replace the vanilla attention scores with a PageRank‑scaled version. Concretely, each node’s attention weight is multiplied by its personalized PageRank score, which reflects how “central” the node is relative to the query node.
Adaptive Pass‑Filter Design – By adjusting the teleport (restart) probability in the PageRank computation, the model can smoothly transition between preserving high‑frequency (local) information and aggregating low‑frequency (global) context.
Integration into Transformer Stack – The PageRank‑enhanced attention is plugged into a standard multi‑head transformer encoder, preserving all the benefits of transformer depth (e.g., expressive power, parallelism) without the need for many GNN layers.

The overall pipeline remains familiar to developers:

input node features → linear projection → PageRank‑aware attention → feed‑forward network → stack → readout (node‑ or graph‑level)

Results & Findings

Task	Datasets (size)	Baseline (GNN/GT)	ParaFormer	Gain
Node classification	Cora, PubMed, ogbn‑arxiv (up to 2M nodes)	GCN, GraphSAGE, vanilla Graph Transformer	+3.2% – +7.5% accuracy	Consistent across scales
Graph classification	MUTAG, PROTEINS, ZINC (up to 1M graphs)	GIN, Graphormer	+2.1% – +5.8% ROC‑AUC	Better handling of long‑range dependencies

Key observations

ParaFormer’s performance gap widens on larger, sparser graphs where over‑smoothing is most harmful.
Ablation studies confirm that the PageRank scaling is the primary driver of improvement; removing it reverts performance to that of vanilla attention.
Sensitivity analysis shows the teleport probability can be tuned per dataset, but a default value (≈0.15) works well out‑of‑the‑box.

Practical Implications

Scalable Graph Learning – Developers can replace deep GNN stacks with a shallower ParaFormer encoder, reducing memory footprints while still capturing global context.
Robustness to Graph Size – Because the PageRank computation can be approximated with fast power‑iteration or personalized PageRank tricks, the model scales to millions of nodes without prohibitive overhead.
Better Feature Preservation – For applications like fraud detection, recommendation, or molecular property prediction where subtle node‑level differences matter, ParaFormer maintains discriminative embeddings where traditional transformers would blur them.
Plug‑and‑Play – The open‑source library provides a drop‑in PyTorch module; existing pipelines that already use Graph Transformers can swap in ParaFormerAttention with minimal code changes.

Limitations & Future Work

PageRank Approximation Cost – Exact PageRank is O(|E|) per layer; while the authors use efficient approximations, extremely dynamic graphs (e.g., streaming edges) may still pose challenges.
Hyper‑parameter Sensitivity – The teleport probability and the number of power‑iteration steps need modest tuning for optimal performance on niche domains.
Theoretical Scope – The current analysis focuses on over‑smoothing; other transformer pathologies (e.g., attention collapse on highly regular graphs) remain unexplored.

Future directions hinted by the authors include extending the adaptive filter concept to heterogeneous graphs, integrating learned teleport probabilities, and exploring hardware‑aware optimizations for the PageRank step.

Authors

Chaohao Yuan
Zhenjie Song
Ercan Engin Kuruoglu
Kangfei Zhao
Yang Liu
Deli Zhao
Hong Cheng
Yu Rong

Paper Information

arXiv ID: 2512.14619v1
Categories: cs.LG
Published: December 16, 2025
PDF: Download PDF

[Paper] ParaFormer: A Generalized PageRank Graph Transformer for Graph Representation Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy