[Paper] RandSet: Randomized Corpus Reduction for Fuzzing Seed Scheduling

Published: (February 26, 2026 at 03:11 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.22729v1

Overview

The paper introduces RandSet, a lightweight, randomized technique for shrinking a fuzzer’s seed corpus while preserving (and even improving) the diversity of inputs it feeds to the target program. By treating corpus reduction as a set‑cover problem and injecting randomness into the solution, RandSet cuts the corpus down to just a few percent of its original size, dramatically easing the “seed explosion” bottleneck that hampers modern fuzzers.

Key Contributions

  • Randomized set‑cover formulation – Cast corpus reduction as a set‑cover problem and solve it with a fast, probabilistic algorithm that yields a small yet diverse subset of seeds.
  • Low‑overhead integration – Implemented RandSet on three mainstream fuzzers (AFL++, LibAFL, and Centipede) with only 1–4 % runtime overhead, making it practical for high‑frequency seed scheduling.
  • Empirical validation – Demonstrated on standalone programs, the FuzzBench benchmark, and the Magma bug‑finding suite that RandSet reduces corpus size to ≈4–6 % of the original while delivering +16 % coverage on average and uncovering up to 7 more real bugs than the prior state‑of‑the‑art.
  • Diversity‑first design – Shows that randomness in the reduction step naturally yields a more varied seed selection than deterministic reducers such as cull_queue, AFL‑cmin, or MinSet.

Methodology

  1. Feature extraction – Each seed is represented by the set of features (e.g., code branches, input tokens) it triggers when executed.
  2. Set‑cover problem – The goal is to pick a minimal subset of seeds whose combined feature set covers all features observed in the full corpus.
  3. Randomized greedy algorithm – Instead of solving the NP‑hard set‑cover exactly, RandSet repeatedly picks a seed at random from the pool of “most‑valuable” candidates (those covering the largest number of uncovered features). This randomness:
    • Guarantees a different reduced subset on each run, preserving diversity.
    • Runs in O(|C| · log |F|) time (|C| = corpus size, |F| = feature count), which is negligible compared to the cost of executing the target program.
  4. Scheduling – The fuzzer’s normal seed‑selection loop is redirected to the reduced subset. Because the subset is tiny, the scheduler can afford richer heuristics (e.g., priority queues, energy allocation) without incurring the original explosion cost.

Results & Findings

BenchmarkAvg. Subset RatioCoverage GainBug‑finding GainOverhead
Standalone programs4.03 %+16.58 %1.17 %
FuzzBench (AFL++)5.99 %+3.57 %2.04 %
Magma (bug suite)+7 bugs (vs. best prior)3.93 %
  • Diversity: RandSet’s reduced sets contain a broader spread of unique features than deterministic reducers, which often collapse many seeds into a few “representatives.”
  • Speed: The random greedy step finishes in milliseconds even for corpora of tens of thousands of seeds, enabling per‑iteration reduction (i.e., the fuzzer can re‑run RandSet frequently).
  • Compatibility: No changes to the underlying fuzzing engine were required; RandSet plugs into the existing seed‑queue management API.

Practical Implications

  • Scalable fuzzing pipelines – Teams can keep fuzzers running for weeks without the memory and CPU pressure caused by ever‑growing corpora.
  • Better resource allocation – With a tiny, high‑quality seed pool, developers can afford more aggressive mutation strategies, longer execution time per seed, or parallel fuzzing across more cores.
  • Improved CI integration – Continuous‑integration fuzzing jobs often hit time limits; RandSet’s low overhead means more coverage per CI minute and earlier bug discovery.
  • Cross‑fuzzer adoption – Since RandSet works with AFL++, LibAFL, and Centipede, any project already using these tools can adopt it with a single library import and a few configuration tweaks.

Limitations & Future Work

  • Feature definition dependency – RandSet’s effectiveness hinges on the quality of the feature extraction (branch coverage, tokenization, etc.). Poor or overly coarse features could lead to sub‑optimal subsets.
  • Randomness variance – While randomness boosts diversity, it also introduces nondeterminism in reproducibility; the authors suggest fixing a seed for deterministic runs when needed.
  • Dynamic program behavior – For programs whose behavior changes dramatically over time (e.g., stateful servers), a static feature set may become stale; future work could explore adaptive feature updates.
  • Extending beyond coverage – The current formulation focuses on coverage features; extending RandSet to incorporate other metrics (e.g., crash‑triggering potential, input size diversity) is an open research direction.

Authors

  • Yuchong Xie
  • Kaikai Zhang
  • Yu Liu
  • Rundong Yang
  • Ping Chen
  • Shuai Wang
  • Dongdong She

Paper Information

  • arXiv ID: 2602.22729v1
  • Categories: cs.SE, cs.CR
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »