[Paper] RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detection
Source: arXiv - 2602.02929v1
Overview
The paper introduces RPG‑AE, a hybrid “neuro‑symbolic” system that blends deep graph representation learning with classic rare‑pattern mining to spot Advanced Persistent Threats (APTs) in system‑level provenance logs. By turning process interactions into a graph, learning its normal structure with a Graph Autoencoder (GAE), and then amplifying suspicious signals with infrequent behavior patterns, the authors achieve state‑of‑the‑art detection on the DARPA Transparent Computing benchmark.
Key Contributions
- Neuro‑symbolic architecture: Combines a Graph Autoencoder (deep learning) with a rare‑pattern mining module (symbolic AI) in a single pipeline.
- k‑NN‑based provenance graph construction: Builds a process‑behavior graph using feature similarity, preserving both temporal and relational context.
- Anomaly scoring via reconstruction error + rarity boost: Detects deviations from the learned normal graph and raises the score for processes exhibiting rarely seen co‑occurrences.
- Comprehensive evaluation: Shows significant improvements over a pure GAE baseline and competitive results against ensembles of multiple unsupervised detectors on the DARPA TC dataset.
- Interpretability hook: The rare‑pattern component provides human‑readable signatures that explain why a process is flagged, bridging the “black‑box” gap of deep models.
Methodology
- Data preprocessing – System‑level provenance events (e.g., file reads, network sockets) are encoded into a feature vector per process (CPU usage, I/O counts, syscall frequencies, etc.).
- Graph construction – For each time window, a k‑Nearest Neighbors (k‑NN) graph is built where nodes are processes and edges connect the k most similar processes based on the feature vectors. This captures “who behaves like whom.”
- Graph Autoencoder (GAE) – A two‑layer Graph Convolutional Network (GCN) encoder compresses each node’s neighborhood into a low‑dimensional embedding; a decoder attempts to reconstruct the adjacency matrix. The reconstruction loss measures how well the model captures the normal relational structure.
- Rare‑pattern mining – Independently, the system mines infrequent sub‑graphs (e.g., a specific combination of file accesses and network calls that appears in < 1 % of windows) using a classic frequent‑itemset algorithm adapted for graphs.
- Anomaly scoring – For a given process, the final score = GAE reconstruction error + rarity boost (if the process participates in a mined rare pattern). The boost is calibrated so that truly anomalous rare patterns outweigh benign noise.
- Ranking & alerting – Processes are ranked by their composite scores; top‑k are presented to analysts.
Results & Findings
| Metric (higher is better) | GAE only | RPG‑AE (GAE + rare boost) | Best prior unsupervised method |
|---|---|---|---|
| AUROC | 0.84 | 0.92 | 0.88 |
| AUPRC | 0.31 | 0.48 | 0.42 |
| Mean Rank of APT events | 57 | 22 | 35 |
- Rare‑pattern boosting improves the ranking of true APT processes by ~60 % relative to the baseline GAE.
- The single RPG‑AE model matches or exceeds ensemble approaches that combine 3–4 separate detectors, while requiring far less engineering overhead.
- Qualitative analysis shows that many high‑scoring alerts correspond to known APT tactics (e.g., lateral movement via uncommon IPC channels), confirming the interpretability benefit.
Practical Implications
- Plug‑and‑play anomaly detector: Security teams can deploy RPG‑AE as a drop‑in module on existing provenance collection pipelines (e.g., Sysdig, Falco, or OS‑level audit logs) without needing to train multiple specialized models.
- Reduced alert fatigue: By surfacing the rarest suspicious patterns, the system prioritizes alerts that are more likely to be true threats, helping SOC analysts focus on high‑value investigations.
- Explainable alerts: The rare‑pattern component supplies a concise “why” (e.g., “process X performed a rare combination of DNS query + privileged file write”), which can be directly fed into ticketing or automated response playbooks.
- Scalable to large environments: The k‑NN graph is built per sliding window, and the GAE scales linearly with node count; rare‑pattern mining can be throttled by adjusting support thresholds, making the approach viable for cloud‑native microservice clusters.
- Foundation for downstream defenses: The learned embeddings can be reused for threat hunting, lateral‑movement detection, or feeding into reinforcement‑learning based response agents.
Limitations & Future Work
- Dependence on quality of provenance data – Missing or noisy logs degrade both the graph structure and the rarity statistics.
- Static rarity thresholds – The current mining step uses a fixed support cutoff; adaptive thresholds could better handle evolving baselines.
- Temporal granularity – The method processes windows independently, which may miss multi‑window attack chains; incorporating recurrent or temporal GNNs is a promising direction.
- Evaluation limited to DARPA TC – While the benchmark is rigorous, broader validation on real‑world enterprise datasets (e.g., Microsoft Azure, Google Cloud) would strengthen claims of generality.
Bottom line: RPG‑AE demonstrates that marrying deep graph learning with classic pattern mining yields a more accurate, interpretable, and operationally friendly solution for provenance‑based APT detection—an approach that developers and security engineers can start experimenting with today.
Authors
- Asif Tauhid
- Sidahmed Benabderrahmane
- Mohamad Altrabulsi
- Ahamed Foisal
- Talal Rahwan
Paper Information
- arXiv ID: 2602.02929v1
- Categories: cs.LG, cs.AI, cs.CR, cs.NE
- Published: February 3, 2026
- PDF: Download PDF