[Paper] RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detection

Published: 3 months ago (February 2, 2026 at 07:02 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.02929v1

Overview

The paper introduces RPG‑AE, a hybrid “neuro‑symbolic” system that blends deep graph representation learning with classic rare‑pattern mining to spot Advanced Persistent Threats (APTs) in system‑level provenance logs. By turning process interactions into a graph, learning its normal structure with a Graph Autoencoder (GAE), and then amplifying suspicious signals with infrequent behavior patterns, the authors achieve state‑of‑the‑art detection on the DARPA Transparent Computing benchmark.

Key Contributions

Neuro‑symbolic architecture: Combines a Graph Autoencoder (deep learning) with a rare‑pattern mining module (symbolic AI) in a single pipeline.
k‑NN‑based provenance graph construction: Builds a process‑behavior graph using feature similarity, preserving both temporal and relational context.
Anomaly scoring via reconstruction error + rarity boost: Detects deviations from the learned normal graph and raises the score for processes exhibiting rarely seen co‑occurrences.
Comprehensive evaluation: Shows significant improvements over a pure GAE baseline and competitive results against ensembles of multiple unsupervised detectors on the DARPA TC dataset.
Interpretability hook: The rare‑pattern component provides human‑readable signatures that explain why a process is flagged, bridging the “black‑box” gap of deep models.

Methodology

Data preprocessing – System‑level provenance events (e.g., file reads, network sockets) are encoded into a feature vector per process (CPU usage, I/O counts, syscall frequencies, etc.).
Graph construction – For each time window, a k‑Nearest Neighbors (k‑NN) graph is built where nodes are processes and edges connect the k most similar processes based on the feature vectors. This captures “who behaves like whom.”
Graph Autoencoder (GAE) – A two‑layer Graph Convolutional Network (GCN) encoder compresses each node’s neighborhood into a low‑dimensional embedding; a decoder attempts to reconstruct the adjacency matrix. The reconstruction loss measures how well the model captures the normal relational structure.
Rare‑pattern mining – Independently, the system mines infrequent sub‑graphs (e.g., a specific combination of file accesses and network calls that appears in < 1 % of windows) using a classic frequent‑itemset algorithm adapted for graphs.
Anomaly scoring – For a given process, the final score = GAE reconstruction error + rarity boost (if the process participates in a mined rare pattern). The boost is calibrated so that truly anomalous rare patterns outweigh benign noise.
Ranking & alerting – Processes are ranked by their composite scores; top‑k are presented to analysts.

Results & Findings

Metric (higher is better)	GAE only	RPG‑AE (GAE + rare boost)	Best prior unsupervised method
AUROC	0.84	0.92	0.88
AUPRC	0.31	0.48	0.42
Mean Rank of APT events	57	22	35

Rare‑pattern boosting improves the ranking of true APT processes by ~60 % relative to the baseline GAE.
The single RPG‑AE model matches or exceeds ensemble approaches that combine 3–4 separate detectors, while requiring far less engineering overhead.
Qualitative analysis shows that many high‑scoring alerts correspond to known APT tactics (e.g., lateral movement via uncommon IPC channels), confirming the interpretability benefit.

Practical Implications

Plug‑and‑play anomaly detector: Security teams can deploy RPG‑AE as a drop‑in module on existing provenance collection pipelines (e.g., Sysdig, Falco, or OS‑level audit logs) without needing to train multiple specialized models.
Reduced alert fatigue: By surfacing the rarest suspicious patterns, the system prioritizes alerts that are more likely to be true threats, helping SOC analysts focus on high‑value investigations.
Explainable alerts: The rare‑pattern component supplies a concise “why” (e.g., “process X performed a rare combination of DNS query + privileged file write”), which can be directly fed into ticketing or automated response playbooks.
Scalable to large environments: The k‑NN graph is built per sliding window, and the GAE scales linearly with node count; rare‑pattern mining can be throttled by adjusting support thresholds, making the approach viable for cloud‑native microservice clusters.
Foundation for downstream defenses: The learned embeddings can be reused for threat hunting, lateral‑movement detection, or feeding into reinforcement‑learning based response agents.

Limitations & Future Work

Dependence on quality of provenance data – Missing or noisy logs degrade both the graph structure and the rarity statistics.
Static rarity thresholds – The current mining step uses a fixed support cutoff; adaptive thresholds could better handle evolving baselines.
Temporal granularity – The method processes windows independently, which may miss multi‑window attack chains; incorporating recurrent or temporal GNNs is a promising direction.
Evaluation limited to DARPA TC – While the benchmark is rigorous, broader validation on real‑world enterprise datasets (e.g., Microsoft Azure, Google Cloud) would strengthen claims of generality.

Bottom line: RPG‑AE demonstrates that marrying deep graph learning with classic pattern mining yields a more accurate, interpretable, and operationally friendly solution for provenance‑based APT detection—an approach that developers and security engineers can start experimenting with today.

Authors

Asif Tauhid
Sidahmed Benabderrahmane
Mohamad Altrabulsi
Ahamed Foisal
Talal Rahwan

Paper Information

arXiv ID: 2602.02929v1
Categories: cs.LG, cs.AI, cs.CR, cs.NE
Published: February 3, 2026
PDF: Download PDF

[Paper] RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data