[Paper] RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detection

Published: (February 2, 2026 at 07:02 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.02929v1

Overview

The paper introduces RPG‑AE, a hybrid “neuro‑symbolic” system that blends deep graph representation learning with classic rare‑pattern mining to spot Advanced Persistent Threats (APTs) in system‑level provenance logs. By turning process interactions into a graph, learning its normal structure with a Graph Autoencoder (GAE), and then amplifying suspicious signals with infrequent behavior patterns, the authors achieve state‑of‑the‑art detection on the DARPA Transparent Computing benchmark.

Key Contributions

  • Neuro‑symbolic architecture: Combines a Graph Autoencoder (deep learning) with a rare‑pattern mining module (symbolic AI) in a single pipeline.
  • k‑NN‑based provenance graph construction: Builds a process‑behavior graph using feature similarity, preserving both temporal and relational context.
  • Anomaly scoring via reconstruction error + rarity boost: Detects deviations from the learned normal graph and raises the score for processes exhibiting rarely seen co‑occurrences.
  • Comprehensive evaluation: Shows significant improvements over a pure GAE baseline and competitive results against ensembles of multiple unsupervised detectors on the DARPA TC dataset.
  • Interpretability hook: The rare‑pattern component provides human‑readable signatures that explain why a process is flagged, bridging the “black‑box” gap of deep models.

Methodology

  1. Data preprocessing – System‑level provenance events (e.g., file reads, network sockets) are encoded into a feature vector per process (CPU usage, I/O counts, syscall frequencies, etc.).
  2. Graph construction – For each time window, a k‑Nearest Neighbors (k‑NN) graph is built where nodes are processes and edges connect the k most similar processes based on the feature vectors. This captures “who behaves like whom.”
  3. Graph Autoencoder (GAE) – A two‑layer Graph Convolutional Network (GCN) encoder compresses each node’s neighborhood into a low‑dimensional embedding; a decoder attempts to reconstruct the adjacency matrix. The reconstruction loss measures how well the model captures the normal relational structure.
  4. Rare‑pattern mining – Independently, the system mines infrequent sub‑graphs (e.g., a specific combination of file accesses and network calls that appears in < 1 % of windows) using a classic frequent‑itemset algorithm adapted for graphs.
  5. Anomaly scoring – For a given process, the final score = GAE reconstruction error + rarity boost (if the process participates in a mined rare pattern). The boost is calibrated so that truly anomalous rare patterns outweigh benign noise.
  6. Ranking & alerting – Processes are ranked by their composite scores; top‑k are presented to analysts.

Results & Findings

Metric (higher is better)GAE onlyRPG‑AE (GAE + rare boost)Best prior unsupervised method
AUROC0.840.920.88
AUPRC0.310.480.42
Mean Rank of APT events572235
  • Rare‑pattern boosting improves the ranking of true APT processes by ~60 % relative to the baseline GAE.
  • The single RPG‑AE model matches or exceeds ensemble approaches that combine 3–4 separate detectors, while requiring far less engineering overhead.
  • Qualitative analysis shows that many high‑scoring alerts correspond to known APT tactics (e.g., lateral movement via uncommon IPC channels), confirming the interpretability benefit.

Practical Implications

  • Plug‑and‑play anomaly detector: Security teams can deploy RPG‑AE as a drop‑in module on existing provenance collection pipelines (e.g., Sysdig, Falco, or OS‑level audit logs) without needing to train multiple specialized models.
  • Reduced alert fatigue: By surfacing the rarest suspicious patterns, the system prioritizes alerts that are more likely to be true threats, helping SOC analysts focus on high‑value investigations.
  • Explainable alerts: The rare‑pattern component supplies a concise “why” (e.g., “process X performed a rare combination of DNS query + privileged file write”), which can be directly fed into ticketing or automated response playbooks.
  • Scalable to large environments: The k‑NN graph is built per sliding window, and the GAE scales linearly with node count; rare‑pattern mining can be throttled by adjusting support thresholds, making the approach viable for cloud‑native microservice clusters.
  • Foundation for downstream defenses: The learned embeddings can be reused for threat hunting, lateral‑movement detection, or feeding into reinforcement‑learning based response agents.

Limitations & Future Work

  • Dependence on quality of provenance data – Missing or noisy logs degrade both the graph structure and the rarity statistics.
  • Static rarity thresholds – The current mining step uses a fixed support cutoff; adaptive thresholds could better handle evolving baselines.
  • Temporal granularity – The method processes windows independently, which may miss multi‑window attack chains; incorporating recurrent or temporal GNNs is a promising direction.
  • Evaluation limited to DARPA TC – While the benchmark is rigorous, broader validation on real‑world enterprise datasets (e.g., Microsoft Azure, Google Cloud) would strengthen claims of generality.

Bottom line: RPG‑AE demonstrates that marrying deep graph learning with classic pattern mining yields a more accurate, interpretable, and operationally friendly solution for provenance‑based APT detection—an approach that developers and security engineers can start experimenting with today.

Authors

  • Asif Tauhid
  • Sidahmed Benabderrahmane
  • Mohamad Altrabulsi
  • Ahamed Foisal
  • Talal Rahwan

Paper Information

  • arXiv ID: 2602.02929v1
  • Categories: cs.LG, cs.AI, cs.CR, cs.NE
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »