[Paper] Democratizing ML for Enterprise Security: A Self-Sustained Attack Detection Framework

Published: 2 months ago (December 9, 2025 at 11:58 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.08802v1

Overview

The paper presents a two‑stage, hybrid threat‑detection framework that blends loose YARA rules with a machine‑learning (ML) classifier, aiming to make advanced ML‑driven security affordable for any enterprise. By automatically generating synthetic training data and continuously learning from analyst feedback, the system keeps detection rules fresh while dramatically cutting false alarms.

Key Contributions

Hybrid detection pipeline: Coarse‑grained YARA filtering followed by a fine‑grained ML classifier to balance recall and precision.
Synthetic data generation with Simula: Enables analysts to create high‑quality training sets without needing large, labeled security datasets.
Active‑learning feedback loop: Real‑time analyst verdicts are fed back to the model, preventing rule decay and continuously improving precision.
Production‑scale validation: Deployed on tens of thousands of endpoints, processing up to 250 B raw events per day and delivering only a few tickets daily.
Low‑maintenance design: Minimal data‑science expertise required; security teams act as “teachers” rather than model developers.

Methodology

Stage 1 – Loose YARA Rules
- Analysts write permissive YARA signatures that aim for high recall (catch as many potential threats as possible).
- These rules act as a fast, lightweight filter on massive log streams, reducing the data volume dramatically.
Stage 2 – ML Classifier
- The filtered events become inputs to a supervised classifier (e.g., gradient‑boosted trees).
- Training data are produced by Simula, a seedless synthetic generator that mimics realistic attack patterns based on analyst‑provided “seed” behaviors.
- The classifier learns to distinguish true threats from the noisy output of the YARA stage.
Active Learning Loop
- When analysts investigate a ticket, their decision (malicious / benign) is automatically logged.
- These labels are fed back to retrain the classifier on a regular schedule, allowing the model to adapt to emerging tactics and to correct drift in the YARA rules.
Deployment Architecture
- Stream processing (e.g., Apache Flink/Kafka) handles the 250 B daily events, applying YARA rules in parallel.
- The ML inference service runs on a scalable GPU/CPU cluster, scoring the reduced event set in near‑real time.
- A ticketing integration pushes only high‑confidence alerts to the SOC.

Results & Findings

Metric	Before Hybrid System	After Hybrid System
Raw events per day	~250 B	~250 B (filtered)
Events after YARA stage	~5 M	—
Events after ML stage (tickets)	—	≈ 10–15
Precision (TP / (TP+FP))	2 % (rule‑only)	≈ 85 % (after 3 months)
Recall (TP / (TP+FN))	95 % (rule‑only)	≈ 92 %
Analyst time per day	8 h	≈ 30 min

Precision improves over time: The active‑learning loop raised precision from ~70 % in week 1 to >85 % after three months.
False‑positive reduction: The ML stage eliminated >99.9 % of the YARA‑generated noise.
Scalability: The pipeline sustained the full 250 B‑event load with sub‑second latency per event batch.

Practical Implications

Cost‑effective SOC scaling: Enterprises can slash analyst workload without hiring additional data‑science staff.
Rapid onboarding: Security teams can start with simple YARA signatures; the system handles the heavy lifting of model training.
Adaptability to new threats: As attackers tweak tactics, analyst verdicts instantly feed back, keeping detection up‑to‑date without manual rule rewrites.
Vendor‑agnostic integration: The framework works with existing SIEMs, log pipelines, and ticketing tools, making it a drop‑in upgrade for legacy environments.
Compliance & auditability: Synthetic data generation is fully reproducible, providing traceable training artifacts for regulatory reviews.

Limitations & Future Work

Synthetic data realism: While Simula produces high‑quality samples, edge‑case attacks that deviate sharply from generated patterns may still slip through.
Model drift detection: The current system relies on analyst feedback; automated drift alerts could further reduce latency in model updates.
Explainability: The ML classifier is a black‑box for many SOC analysts; integrating interpretable models or post‑hoc explanations would boost trust.
Cross‑domain generalization: Experiments were limited to Windows‑based endpoint logs; extending to cloud‑native workloads and network telemetry is an open avenue.

Bottom line: By marrying permissive YARA rules with a self‑sustaining ML engine powered by synthetic data and active learning, the authors demonstrate a pragmatic path to democratize advanced threat detection across enterprises of any size.

Authors

Sadegh Momeni
Ge Zhang
Birkett Huber
Hamza Harkous
Sam Lipton
Benoit Seguin
Yanis Pavlidis

Paper Information

arXiv ID: 2512.08802v1
Categories: cs.CR, cs.AI
Published: December 9, 2025
PDF: Download PDF

[Paper] Democratizing ML for Enterprise Security: A Self-Sustained Attack Detection Framework

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously