[Paper] LLM-Driven Kernel Evolution: Automating Driver Updates in Linux
Source: arXiv - 2511.18924v1
Overview
The Linux kernel evolves rapidly, and every change can break the thousands of device drivers that depend on it. The paper introduces DRIVEBENCH, a curated corpus of real kernel‑driver co‑evolution cases, and AUTODRIVER, a closed‑loop system that leverages large language models (LLMs) to automatically generate and validate driver patches. By combining prompt engineering, multi‑agent collaboration, static analysis, and runtime testing, the authors demonstrate a practical path toward keeping drivers in sync with the kernel without manual rewrites.
Key Contributions
- DRIVEBENCH corpus: 235 fully validated kernel‑driver evolution cases (spanning Linux 5.10–6.10) extracted from an initial pool of 612 candidates, released publicly for reproducible research.
- AUTODRIVER framework: An end‑to‑end pipeline that (1) formulates precise LLM prompts, (2) orchestrates multiple LLM agents to propose patches, (3) runs static analysis to catch API/ABI mismatches, and (4) iteratively validates patches via compilation and QEMU‑based boot tests.
- Empirical evaluation: On 55 unseen evolution cases, AUTODRIVER achieves a 56.4 % compilation success rate, and most compiled patches preserve driver initialization when exercised in a virtualized environment.
- Open‑source tooling: All scripts, prompts, and evaluation harnesses are released under permissive licenses, enabling the community to extend or integrate the approach into CI pipelines.
Methodology
- Case collection: The authors mined the Linux git history for commits that simultaneously modify kernel core files and driver source files. Manual vetting filtered out noisy or irrelevant changes, yielding the DRIVEBENCH dataset.
- Prompt engineering: For each case, a structured prompt is built that includes (a) the original driver code, (b) the kernel diff, and (c) a concise description of the required semantic change (e.g., “replace
pci_register_driverwithmodule_driver”). - Multi‑agent LLM workflow:
- Generator agent produces an initial patch.
- Reviewer agent runs static analysis tools (e.g.,
sparse,clang-tidy) on the patch and feeds back diagnostics. - Refiner agent iteratively amends the patch until static checks pass.
- Closed‑loop validation: The refined patch is compiled against the target kernel version. Successful builds are then executed inside QEMU to verify that the driver loads, initializes, and does not crash during early boot. Failures trigger another refinement cycle or are logged for manual inspection.
Results & Findings
- Compilation success: 31 out of 55 test cases compiled cleanly after the LLM‑driven refinement loop (56.4 %).
- Runtime sanity: In QEMU boot tests, 27 of the compiled patches retained correct driver initialization (e.g., proper probe callbacks, resource allocation).
- Error patterns: Most failures stemmed from subtle ABI changes (e.g., struct layout modifications) that static analysis missed, highlighting the need for deeper semantic checks.
- Efficiency: The average end‑to‑end turnaround per case was ~12 minutes on a single GPU‑enabled workstation, suggesting feasibility for integration into continuous integration (CI) pipelines.
Practical Implications
- Automated driver maintenance: Kernel maintainers and OEMs can plug AUTODRIVER into their CI to automatically generate patches whenever a new kernel release is cut, dramatically reducing the backlog of driver breakages.
- Faster security updates: Security‑hardening patches in the kernel can be propagated to drivers without waiting for manual porting, shrinking the window of exposure for vulnerable hardware.
- Vendor‑agnostic tooling: Because the system works on raw source diffs and standard static analysis tools, it can be adopted by any organization that ships Linux drivers (embedded, automotive, IoT, etc.).
- Research platform: DRIVEBENCH provides a benchmark for future work on LLM‑assisted code transformation, enabling comparative studies on prompt strategies, model sizes, or alternative verification methods.
Limitations & Future Work
- Partial success rate: Only about half of the cases compile and run correctly; more sophisticated semantic analysis (e.g., type‑state modeling) is needed to capture ABI nuances.
- Model dependence: Results are tied to the specific LLM and prompting strategy used; exploring open‑source models or fine‑tuning on kernel code could improve robustness.
- Scalability to large drivers: The current pipeline handles modest‑size drivers; scaling to massive subsystems (e.g., networking stacks) may require hierarchical prompting or chunked analysis.
- Human‑in‑the‑loop: The authors envision a semi‑automated workflow where developers review LLM‑generated patches before merging, especially for safety‑critical hardware. Future work will study optimal points for human intervention and UI design for such collaboration.
Authors
- Arina Kharlamova
- Jiawen Liu
- Tianyi Zhang
- Xinrui Yang
- Humaid Alqasimi
- Youcheng Sun
- Chun Jason Xue
Paper Information
- arXiv ID: 2511.18924v1
- Categories: cs.SE, cs.AI
- Published: November 24, 2025
- PDF: Download PDF