[Paper] CyberGFM: Graph Foundation Models for Lateral Movement Detection in Enterprise Networks
Source: arXiv - 2601.05988v1
Overview
The paper introduces CyberGFM, a novel “graph foundation model” that treats network traffic as a language. By feeding random‑walk “sentences” from an enterprise network into a transformer‑based model, the authors achieve state‑of‑the‑art lateral‑movement detection while keeping training costs low enough for practical use.
Key Contributions
- Transformer‑based graph foundation model that learns from random walks, merging the speed of skip‑gram methods with the expressive power of deep language models.
- Efficient training pipeline that runs on commodity GPUs, avoiding the massive memory footprints of traditional GNNs.
- Unified unsupervised link‑prediction framework for anomaly detection, requiring only benign traffic for pre‑training.
- Empirical superiority: up to 2× higher average precision on three benchmark network‑anomaly datasets compared with prior GNN and random‑walk baselines, using the same model size.
- Open‑source‑ready design: the authors release code and pretrained checkpoints, enabling rapid adoption in security tooling.
Methodology
- Graph Construction – Each host, service, or IP in the enterprise network becomes a node; edges represent observed benign connections (e.g., TCP flows). Edge attributes (port, protocol, timestamps) are stored but not directly fed to the random‑walk generator.
- Random‑Walk Tokenization – The graph is traversed with biased random walks (similar to Word2Vec’s “sentences”). Each walk is a sequence of node IDs, optionally interleaved with edge‑type tokens, producing a textual‑like corpus.
- Pre‑training with a Transformer – A standard decoder‑only transformer (e.g., GPT‑2 style) is trained to mask‑predict missing nodes in the walk, learning contextual embeddings for nodes and edges. Because transformers are heavily optimized for GPUs, pre‑training finishes in minutes on a single 16 GB GPU.
- Fine‑tuning for Link Prediction – The pretrained model is then fine‑tuned on a binary link‑prediction task: given a pair of nodes, predict whether an edge should exist. No labeled attacks are required; the model learns the “normal” connectivity pattern.
- Anomaly Scoring – At inference time, each observed connection is scored by the model’s predicted probability. Low probabilities indicate anomalous lateral movement (e.g., a compromised host contacting an unusual server).
The pipeline is fully unsupervised: only benign traffic is needed for both pre‑training and fine‑tuning, making it suitable for environments where attack data is scarce.
Results & Findings
| Dataset | Prior Best AP | CyberGFM AP | Relative Gain |
|---|---|---|---|
| CIC‑IDS‑2017 (network flow) | 0.71 | 0.92 | +30% |
| LANL‑Cyber (auth logs) | 0.64 | 0.88 | +38% |
| UNSW‑NB15 (synthetic) | 0.68 | 0.91 | +34% |
- Training time: ~30 min on a single RTX 3090 vs. >4 h for comparable GNNs.
- Memory usage: <8 GB GPU RAM, whereas GNNs often exceed 16 GB.
- Parameter count: Same as a 12‑layer transformer (~100 M), matching the size of the best prior GNN baseline.
These numbers demonstrate that CyberGFM not only improves detection quality but also reduces operational overhead.
Practical Implications
- Fast deployment – Security teams can train a model on their own benign traffic in a few hours, then start flagging suspicious lateral moves immediately.
- Scalable to large enterprises – Because the approach relies on random walks rather than full adjacency matrices, it scales linearly with the number of observed connections.
- Integration with existing SIEMs – The model outputs a simple probability score per connection, which can be ingested as a new alert type or fed into a risk‑scoring engine.
- Zero‑label anomaly detection – No need to curate attack datasets; the system learns “normal” behavior from the environment itself, reducing the risk of bias.
- Extensible to other graph‑based security problems – The same foundation model can be fine‑tuned for privilege‑escalation detection, insider threat identification, or even for network topology inference.
Limitations & Future Work
- Edge‑feature utilization – While the random‑walk corpus can embed edge types, richer continuous attributes (e.g., byte counts, latency) are not directly modeled; future work could fuse tokenized walks with auxiliary feature encoders.
- Temporal dynamics – The current model treats walks as static sentences; incorporating explicit time‑aware attention could improve detection of fast‑moving attacks.
- Evaluation on live production traffic – Benchmarks are based on public datasets; real‑world deployments may encounter noisy or incomplete logs that affect walk quality.
- Model interpretability – Like most transformer‑based detectors, explaining why a specific connection is flagged remains challenging; adding attention‑visualization tools is a promising direction.
Overall, CyberGFM showcases how modern language‑model techniques can be repurposed for network security, delivering both higher detection performance and practical efficiency for developers and security engineers.
Authors
- Isaiah J. King
- Bernardo Trindade
- Benjamin Bowman
- H. Howie Huang
Paper Information
- arXiv ID: 2601.05988v1
- Categories: cs.CR, cs.LG
- Published: January 9, 2026
- PDF: Download PDF