[Paper] Mitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning

Published: (May 5, 2026 at 01:21 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.04000v1

Overview

Rust’s promise of “memory‑safe by design” is only as strong as the tools that verify it. This paper tackles a painful reality: static memory‑safety analyzers for Rust (e.g., Rudra, MirChecker) drown developers in false positives, eroding trust and wasting time. By marrying reinforcement learning (RL) with lightweight dynamic fuzzing, the authors devise a system that automatically learns to silence spurious warnings while still catching real bugs.

Key Contributions

  • RL‑driven warning suppression: A reinforcement‑learning agent that learns a policy for classifying static‑analysis warnings as true or false based on features extracted from Rust’s MIR.
  • Hybrid static‑dynamic feedback loop: Integration of cargo‑fuzz as an auxiliary validator, allowing the agent to confirm or refute borderline warnings through targeted fuzz testing.
  • Empirical superiority over LLM baselines: Achieves 65.2 % accuracy and an F1 of 0.659— 17.1 % higher than the best large‑language‑model (LLM) classifier evaluated on the same dataset.
  • Substantial precision boost: Raises precision from 25.6 % (raw Rudra) to 59.0 % while maintaining a respectable recall of 74.6 %.
  • Open‑source prototype: The implementation is released as a Rust crate, enabling immediate experimentation by the community.

Methodology

  1. Feature extraction from MIR: The tool parses the Mid‑level Intermediate Representation (MIR) generated by the Rust compiler and derives contextual cues (e.g., lifetimes, borrow patterns, control‑flow metrics) for each warning.
  2. Reinforcement learning setup:
    • State: Vector of MIR‑derived features plus the raw warning type.
    • Action: “Keep” (treat as a real bug) or “Suppress” (mark as false positive).
    • Reward: +1 for correctly classifying a true bug, –1 for suppressing a real bug, and a small penalty for unnecessary manual review.
    • Policy learning: A lightweight deep Q‑network (DQN) is trained on a curated corpus of Rust projects with ground‑truth labels (derived from prior bug reports and manual inspection).
  3. Dynamic validation via cargo‑fuzz: When the agent’s confidence falls below a threshold, the system automatically generates a focused fuzz harness targeting the suspicious code region. Successful crashes confirm a true bug; no crash reinforces suppression.
  4. Iterative refinement: The RL agent updates its policy after each fuzzing round, gradually improving its discrimination capability.

Results & Findings

MetricRaw RudraRL‑onlyRL + Fuzz (proposed)
Accuracy48.1 %57.5 %65.2 %
Precision25.6 %51.4 %59.0 %
Recall81.2 %70.1 %74.6 %
F1 Score0.380.550.659
  • The RL‑only model already halves the false‑positive rate compared with raw static analysis.
  • Adding targeted fuzzing yields an extra 10.7 % boost in accuracy and 8.6 % in F1, confirming that dynamic feedback is a powerful correctness signal.
  • Compared with the strongest LLM‑based classifier (prompt‑engineered GPT‑4), the hybrid approach improves accuracy by 17.1 % and F1 by 0.12 points.

Practical Implications

  • Developer productivity: Teams can integrate the tool into CI pipelines; the system automatically filters out noisy warnings, letting engineers focus on genuine memory‑safety defects.
  • Higher adoption of Rust in safety‑critical domains: Reduced false positives make static verification more palatable for aerospace, automotive, and medical software where certification hinges on trustworthy analysis results.
  • Cost‑effective security testing: By automatically generating focused fuzz harnesses only for ambiguous warnings, the approach avoids the expense of full‑scale fuzzing while still gaining dynamic validation where it matters most.
  • Extensibility to other languages: The RL‑plus‑dynamic‑feedback pattern can be transplanted to static analyses for C/C++, Go, or even higher‑level security linters, offering a roadmap for broader false‑positive mitigation.

Limitations & Future Work

  • Training data dependence: The RL agent’s effectiveness hinges on a labeled dataset of Rust warnings; projects with scarce historical bug data may see slower convergence.
  • Fuzzing overhead: Although targeted, the dynamic validation step adds latency to CI runs; scaling to very large codebases may require parallel fuzzing infrastructure.
  • Generalization to exotic MIR constructs: The current feature set may miss edge‑case patterns (e.g., unsafe blocks with complex lifetimes), potentially leading to missed bugs.
  • Future directions: The authors plan to explore meta‑learning to transfer policies across projects, incorporate richer static‑analysis signals (e.g., type‑state models), and evaluate the approach on real‑world industrial Rust codebases under strict certification regimes.

Authors

  • P Akilesh
  • Leuson Da Silva
  • Foutse Khomh
  • Sridhar Chimalakonda

Paper Information

  • arXiv ID: 2605.04000v1
  • Categories: cs.SE
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »