[Paper] Democratizing ML for Enterprise Security: A Self-Sustained Attack Detection Framework

Published: (December 9, 2025 at 11:58 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.08802v1

Overview

The paper presents a two‑stage, hybrid threat‑detection framework that blends loose YARA rules with a machine‑learning (ML) classifier, aiming to make advanced ML‑driven security affordable for any enterprise. By automatically generating synthetic training data and continuously learning from analyst feedback, the system keeps detection rules fresh while dramatically cutting false alarms.

Key Contributions

  • Hybrid detection pipeline: Coarse‑grained YARA filtering followed by a fine‑grained ML classifier to balance recall and precision.
  • Synthetic data generation with Simula: Enables analysts to create high‑quality training sets without needing large, labeled security datasets.
  • Active‑learning feedback loop: Real‑time analyst verdicts are fed back to the model, preventing rule decay and continuously improving precision.
  • Production‑scale validation: Deployed on tens of thousands of endpoints, processing up to 250 B raw events per day and delivering only a few tickets daily.
  • Low‑maintenance design: Minimal data‑science expertise required; security teams act as “teachers” rather than model developers.

Methodology

  1. Stage 1 – Loose YARA Rules

    • Analysts write permissive YARA signatures that aim for high recall (catch as many potential threats as possible).
    • These rules act as a fast, lightweight filter on massive log streams, reducing the data volume dramatically.
  2. Stage 2 – ML Classifier

    • The filtered events become inputs to a supervised classifier (e.g., gradient‑boosted trees).
    • Training data are produced by Simula, a seedless synthetic generator that mimics realistic attack patterns based on analyst‑provided “seed” behaviors.
    • The classifier learns to distinguish true threats from the noisy output of the YARA stage.
  3. Active Learning Loop

    • When analysts investigate a ticket, their decision (malicious / benign) is automatically logged.
    • These labels are fed back to retrain the classifier on a regular schedule, allowing the model to adapt to emerging tactics and to correct drift in the YARA rules.
  4. Deployment Architecture

    • Stream processing (e.g., Apache Flink/Kafka) handles the 250 B daily events, applying YARA rules in parallel.
    • The ML inference service runs on a scalable GPU/CPU cluster, scoring the reduced event set in near‑real time.
    • A ticketing integration pushes only high‑confidence alerts to the SOC.

Results & Findings

MetricBefore Hybrid SystemAfter Hybrid System
Raw events per day~250 B~250 B (filtered)
Events after YARA stage~5 M
Events after ML stage (tickets)≈ 10–15
Precision (TP / (TP+FP))2 % (rule‑only)≈ 85 % (after 3 months)
Recall (TP / (TP+FN))95 % (rule‑only)≈ 92 %
Analyst time per day8 h≈ 30 min
  • Precision improves over time: The active‑learning loop raised precision from ~70 % in week 1 to >85 % after three months.
  • False‑positive reduction: The ML stage eliminated >99.9 % of the YARA‑generated noise.
  • Scalability: The pipeline sustained the full 250 B‑event load with sub‑second latency per event batch.

Practical Implications

  • Cost‑effective SOC scaling: Enterprises can slash analyst workload without hiring additional data‑science staff.
  • Rapid onboarding: Security teams can start with simple YARA signatures; the system handles the heavy lifting of model training.
  • Adaptability to new threats: As attackers tweak tactics, analyst verdicts instantly feed back, keeping detection up‑to‑date without manual rule rewrites.
  • Vendor‑agnostic integration: The framework works with existing SIEMs, log pipelines, and ticketing tools, making it a drop‑in upgrade for legacy environments.
  • Compliance & auditability: Synthetic data generation is fully reproducible, providing traceable training artifacts for regulatory reviews.

Limitations & Future Work

  • Synthetic data realism: While Simula produces high‑quality samples, edge‑case attacks that deviate sharply from generated patterns may still slip through.
  • Model drift detection: The current system relies on analyst feedback; automated drift alerts could further reduce latency in model updates.
  • Explainability: The ML classifier is a black‑box for many SOC analysts; integrating interpretable models or post‑hoc explanations would boost trust.
  • Cross‑domain generalization: Experiments were limited to Windows‑based endpoint logs; extending to cloud‑native workloads and network telemetry is an open avenue.

Bottom line: By marrying permissive YARA rules with a self‑sustaining ML engine powered by synthetic data and active learning, the authors demonstrate a pragmatic path to democratize advanced threat detection across enterprises of any size.

Authors

  • Sadegh Momeni
  • Ge Zhang
  • Birkett Huber
  • Hamza Harkous
  • Sam Lipton
  • Benoit Seguin
  • Yanis Pavlidis

Paper Information

  • arXiv ID: 2512.08802v1
  • Categories: cs.CR, cs.AI
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »