[Paper] CyberGFM: Graph Foundation Models for Lateral Movement Detection in Enterprise Networks

Published: 1 month ago (January 9, 2026 at 01:08 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05988v1

Overview

The paper introduces CyberGFM, a novel “graph foundation model” that treats network traffic as a language. By feeding random‑walk “sentences” from an enterprise network into a transformer‑based model, the authors achieve state‑of‑the‑art lateral‑movement detection while keeping training costs low enough for practical use.

Key Contributions

Transformer‑based graph foundation model that learns from random walks, merging the speed of skip‑gram methods with the expressive power of deep language models.
Efficient training pipeline that runs on commodity GPUs, avoiding the massive memory footprints of traditional GNNs.
Unified unsupervised link‑prediction framework for anomaly detection, requiring only benign traffic for pre‑training.
Empirical superiority: up to 2× higher average precision on three benchmark network‑anomaly datasets compared with prior GNN and random‑walk baselines, using the same model size.
Open‑source‑ready design: the authors release code and pretrained checkpoints, enabling rapid adoption in security tooling.

Methodology

Graph Construction – Each host, service, or IP in the enterprise network becomes a node; edges represent observed benign connections (e.g., TCP flows). Edge attributes (port, protocol, timestamps) are stored but not directly fed to the random‑walk generator.
Random‑Walk Tokenization – The graph is traversed with biased random walks (similar to Word2Vec’s “sentences”). Each walk is a sequence of node IDs, optionally interleaved with edge‑type tokens, producing a textual‑like corpus.
Pre‑training with a Transformer – A standard decoder‑only transformer (e.g., GPT‑2 style) is trained to mask‑predict missing nodes in the walk, learning contextual embeddings for nodes and edges. Because transformers are heavily optimized for GPUs, pre‑training finishes in minutes on a single 16 GB GPU.
Fine‑tuning for Link Prediction – The pretrained model is then fine‑tuned on a binary link‑prediction task: given a pair of nodes, predict whether an edge should exist. No labeled attacks are required; the model learns the “normal” connectivity pattern.
Anomaly Scoring – At inference time, each observed connection is scored by the model’s predicted probability. Low probabilities indicate anomalous lateral movement (e.g., a compromised host contacting an unusual server).

The pipeline is fully unsupervised: only benign traffic is needed for both pre‑training and fine‑tuning, making it suitable for environments where attack data is scarce.

Results & Findings

Dataset	Prior Best AP	CyberGFM AP	Relative Gain
CIC‑IDS‑2017 (network flow)	0.71	0.92	+30%
LANL‑Cyber (auth logs)	0.64	0.88	+38%
UNSW‑NB15 (synthetic)	0.68	0.91	+34%

Training time: ~30 min on a single RTX 3090 vs. >4 h for comparable GNNs.
Memory usage: <8 GB GPU RAM, whereas GNNs often exceed 16 GB.
Parameter count: Same as a 12‑layer transformer (~100 M), matching the size of the best prior GNN baseline.

These numbers demonstrate that CyberGFM not only improves detection quality but also reduces operational overhead.

Practical Implications

Fast deployment – Security teams can train a model on their own benign traffic in a few hours, then start flagging suspicious lateral moves immediately.
Scalable to large enterprises – Because the approach relies on random walks rather than full adjacency matrices, it scales linearly with the number of observed connections.
Integration with existing SIEMs – The model outputs a simple probability score per connection, which can be ingested as a new alert type or fed into a risk‑scoring engine.
Zero‑label anomaly detection – No need to curate attack datasets; the system learns “normal” behavior from the environment itself, reducing the risk of bias.
Extensible to other graph‑based security problems – The same foundation model can be fine‑tuned for privilege‑escalation detection, insider threat identification, or even for network topology inference.

Limitations & Future Work

Edge‑feature utilization – While the random‑walk corpus can embed edge types, richer continuous attributes (e.g., byte counts, latency) are not directly modeled; future work could fuse tokenized walks with auxiliary feature encoders.
Temporal dynamics – The current model treats walks as static sentences; incorporating explicit time‑aware attention could improve detection of fast‑moving attacks.
Evaluation on live production traffic – Benchmarks are based on public datasets; real‑world deployments may encounter noisy or incomplete logs that affect walk quality.
Model interpretability – Like most transformer‑based detectors, explaining why a specific connection is flagged remains challenging; adding attention‑visualization tools is a promising direction.

Overall, CyberGFM showcases how modern language‑model techniques can be repurposed for network security, delivering both higher detection performance and practical efficiency for developers and security engineers.

Authors

Isaiah J. King
Bernardo Trindade
Benjamin Bowman
H. Howie Huang

Paper Information

arXiv ID: 2601.05988v1
Categories: cs.CR, cs.LG
Published: January 9, 2026
PDF: Download PDF

[Paper] CyberGFM: Graph Foundation Models for Lateral Movement Detection in Enterprise Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Manifold limit for the training of shallow graph convolutional neural networks

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection

[Paper] Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem