[Paper] Optimization-Aware Test Generation for Deep Learning Compilers

Published: 1 week ago (November 24, 2025 at 04:27 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.18918v1

Overview

Deep learning compilers such as TVM and ONNX Runtime translate high‑level neural‑network models into hardware‑specific code, applying aggressive optimizations to squeeze out performance. Because these compilers sit at the heart of the DL deployment pipeline, bugs in their optimization passes can silently degrade accuracy, cause crashes, or open security holes. The paper introduces OATest, a technique that automatically generates optimization‑aware test programs (computational graphs) to stress‑test DL compilers more thoroughly than prior methods.

Key Contributions

Optimization‑aware graph synthesis – extracts real‑world optimization patterns from existing compiler test suites and injects them into seed graphs, ensuring the generated tests actually trigger compiler optimizations.
Edge‑reusing strategy – tightly couples extracted patterns with surrounding graph context, preserving data‑flow relationships that are crucial for realistic optimization paths.
Auxiliary‑layer insertion – automatically repairs broken tensor shape or type constraints in the synthesized graphs, guaranteeing that the generated programs are semantically valid.
Dual‑oracle differential testing – leverages both (i) output‑equivalence checking against a reference interpreter and (ii) internal compiler state comparison, enabling detection of subtle bugs in TVM and ONNX Runtime.
Empirical impact – OATest discovers 58 new bugs (36 already confirmed/fixed) and achieves higher statement and branch coverage than the previous state‑of‑the‑art fuzzing tool.

Methodology

Pattern Mining – The authors parse the official test suites of TVM and ONNX Runtime, identifying recurring sub‑graphs that are explicitly targeted by optimization passes (e.g., operator fusion, constant folding).
Seed Graph Selection – A pool of simple, well‑typed computational graphs is collected from open‑source model repositories (e.g., ONNX model zoo).
Graph Augmentation
- Edge Reuse: For each mined pattern, OATest re‑uses existing edges of the seed graph to embed the pattern, preserving realistic data dependencies.
- Auxiliary Layers: When the insertion breaks shape or type constraints, OATest automatically adds “helper” layers (e.g., reshape, cast) to restore validity.
Test Oracle Construction
- Differential Oracle: Runs the same graph through two compilers (or a compiler vs. a reference interpreter) and flags mismatches in numerical output or execution crashes.
- Coverage Oracle: Instruments the compiler source to collect statement/branch coverage, guiding the fuzzer toward unexplored optimization code.
Search Loop – A guided fuzzing loop repeatedly mutates graphs, monitors coverage feedback, and records any divergence reported by the oracles.

Results & Findings

Metric	OATest	Prior State‑of‑the‑Art (e.g., DL‑Fuzz)
Detected bugs (TVM)	42	21
Detected bugs (ONNX Runtime)	16	9
Statement coverage increase	+12 %	—
Branch coverage increase	+15 %	—
Valid test generation rate	98 % (after auxiliary‑layer fix)	84 %

Bug Types: Most uncovered bugs reside in optimization passes such as operator fusion, layout transformation, and memory planning. Several bugs caused silent numerical drift, while others led to crashes or illegal memory accesses.
Developer Impact: 36 of the 58 newly found bugs were acknowledged and patched by the TVM and ONNX Runtime teams within weeks of disclosure.

Practical Implications

More Robust Deployments: By integrating OATest into CI pipelines, DL framework vendors can catch optimization‑related regressions before they reach production, reducing costly runtime failures in edge or cloud services.
Security Hardening: Optimization passes often manipulate memory layouts; bugs there can be exploited for denial‑of‑service or code‑execution attacks. Systematic, optimization‑aware testing raises the security baseline of DL compilers.
Developer Tooling: The edge‑reuse and auxiliary‑layer techniques can be packaged as a library for developers building custom compiler passes, enabling rapid sanity‑checking of new transformations.
Benchmarking & Competition: Companies can use OATest‑generated graphs as stress‑test suites to benchmark the effectiveness and safety of their proprietary compilers against open‑source baselines.

Limitations & Future Work

Pattern Dependency: OATest’s effectiveness hinges on the quality and diversity of mined patterns; compilers with sparse public test suites may yield fewer useful patterns.
Scalability to Large Models: The current workflow focuses on relatively small synthetic graphs; extending the approach to full‑scale models (e.g., GPT‑3) may require additional memory‑aware heuristics.
Oracle Coverage: Differential testing assumes a correct reference implementation; for novel optimizations without a trusted baseline, false positives could arise.
Future Directions: The authors suggest (i) automated discovery of new optimization patterns via static analysis of compiler source, (ii) integration with hardware‑in‑the‑loop testing to capture device‑specific bugs, and (iii) applying machine‑learning‑guided mutation strategies to further improve coverage.

Authors

Qingchao Shen
Zan Wang
Haoyang Ma
Yongqiang Tian
Lili Huang
Zibo Xiao
Junjie Chen
Shing-Chi Cheung

Paper Information

arXiv ID: 2511.18918v1
Categories: cs.SE
Published: November 24, 2025
PDF: Download PDF

[Paper] Optimization-Aware Test Generation for Deep Learning Compilers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Configuration Defects in Kubernetes

[Paper] POLARIS: Is Multi-Agentic Reasoning the Next Wave in Engineering Self-Adaptive Systems?

[Paper] Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation