[Paper] IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs
Source: arXiv - 2602.22017v1
Overview
The paper introduces IOAgent, an AI‑driven assistant that brings expert‑level I/O performance diagnosis to everyday HPC users. By leveraging large language models (LLMs) together with domain‑specific knowledge bases, IOAgent can automatically analyze Darshan I/O traces, pinpoint bottlenecks, and explain its reasoning—making trustworthy performance debugging accessible to scientists who lack dedicated I/O experts.
Key Contributions
- End‑to‑end diagnosis pipeline that combines a modular pre‑processor, a Retrieval‑Augmented Generation (RAG) knowledge integrator, and a tree‑based answer merger to handle long trace files.
- TraceBench, the first publicly released benchmark suite of labeled HPC I/O traces for systematic evaluation of diagnosis tools.
- LLM‑agnostic design: IOAgent works with both proprietary (e.g., GPT‑4) and open‑source (e.g., LLaMA) models without sacrificing accuracy.
- Explainable output: every diagnosis is accompanied by detailed justifications and citations to relevant documentation, mirroring the workflow of a human I/O expert.
- Interactive query interface that lets users ask follow‑up questions, enabling a conversational debugging experience.
Methodology
- Trace Ingestion & Pre‑processing – The raw Darshan trace (often megabytes) is split into logical chunks (e.g., per MPI rank, per I/O phase). A lightweight parser extracts key metrics (bytes transferred, operation counts, timestamps).
- Domain Knowledge Retrieval – A curated corpus of HPC storage documentation, best‑practice guides, and prior diagnosis reports is indexed. When a trace chunk is fed to the LLM, a RAG component fetches the most relevant passages to ground the model’s reasoning.
- LLM Reasoning – The selected LLM receives the chunk plus retrieved knowledge as context. Prompt engineering forces the model to produce a structured diagnosis (symptom, root cause, suggested fix) and to cite the supporting passages.
- Tree‑Based Merger – Individual chunk‑level diagnoses are merged into a coherent, hierarchical report. Conflicts are resolved by a voting scheme that prefers diagnoses with higher confidence scores and stronger citations.
- Interactive Layer – Users can query the final report (e.g., “Why is my collective I/O slow?”) and the system re‑runs the relevant sub‑tree through the LLM, preserving the original justification chain.
Results & Findings
- Accuracy: On TraceBench (≈1,200 labeled traces), IOAgent achieved a 92 % correct‑diagnosis rate, surpassing the previous state‑of‑the‑art tool IOTrace (84 %).
- Explainability: 96 % of IOAgent’s reports included at least one verifiable citation, compared to 68 % for baseline LLM‑only approaches that suffered from hallucinations.
- LLM Independence: Experiments with GPT‑4, Claude, and the open‑source LLaMA‑2‑13B showed less than 3 % variance in diagnosis quality, confirming the pipeline’s model‑agnostic nature.
- Performance: End‑to‑end latency averaged 12 seconds per trace (≈200 MB), well within interactive use‑case limits.
- User Study: A small group of domain scientists reported a 45 % reduction in time spent on I/O debugging after adopting IOAgent.
Practical Implications
- Democratizing Expertise – Smaller research groups can now obtain reliable I/O diagnostics without hiring a dedicated storage engineer, accelerating time‑to‑science for data‑intensive workloads.
- Integration into Job Schedulers – IOAgent can be hooked into Slurm or PBS to automatically analyze completed jobs and surface performance tips in the job’s post‑mortem logs.
- Continuous Monitoring – By feeding live Darshan traces, administrators can proactively detect emerging storage pathologies (e.g., contention, mis‑aligned I/O) before they impact production runs.
- Vendor‑Neutral Tuning – Since the system relies on generic storage knowledge rather than vendor‑specific heuristics, it can be deployed across heterogeneous HPC clusters (Lustre, GPFS, BeeGFS).
- Open‑Source Ecosystem – The released TraceBench and the modular pipeline invite community extensions—e.g., adding support for other tracing formats (e.g., Score‑P) or custom domain corpora.
Limitations & Future Work
- Context Window Still Bounded – Extremely large traces (>1 GB) require additional chunking heuristics, which may miss cross‑chunk correlations.
- Knowledge Base Staleness – The RAG corpus must be periodically refreshed to stay current with evolving storage technologies and vendor documentation.
- Hallucination Risk in Edge Cases – Although mitigated, rare mis‑diagnoses still occur when the LLM extrapolates beyond the retrieved material.
- Scalability of Interactive Queries – Real‑time follow‑up on massive reports can introduce latency; future work will explore caching and incremental reasoning.
- Broader Benchmarking – Extending TraceBench to include emerging workloads (e.g., AI model checkpointing) and multi‑tenant cloud‑HPC environments is planned.
IOAgent showcases how LLMs, when tightly coupled with domain‑specific retrieval and structured merging, can transform a niche expert skill into a widely usable service—ushering in a new era of AI‑assisted HPC performance engineering.
Authors
- Chris Egersdoerfer
- Arnav Sareen
- Jean Luca Bez
- Suren Byna
- Dongkuan
- Xu
- Dong Dai
Paper Information
- arXiv ID: 2602.22017v1
- Categories: cs.DC
- Published: February 25, 2026
- PDF: Download PDF