[Paper] STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems
Source: arXiv - 2602.23220v1
Overview
The paper introduces STELLAR, an autonomous tuning engine that leverages large language models (LLMs) to optimize the configuration of high‑performance parallel file systems. By turning the traditionally manual, trial‑and‑error process of I/O tuning into a fast, data‑driven loop, STELLAR can find near‑optimal settings in just a handful of runs, making storage performance tuning practical for scientists and engineers who lack deep systems expertise.
Key Contributions
- LLM‑driven end‑to‑end tuning pipeline that extracts tunable parameters from documentation, interprets I/O traces, and iteratively refines configurations.
- Retrieval‑augmented generation (RAG) + tool‑use architecture that grounds LLM reasoning in real system data, dramatically reducing hallucinations.
- Multi‑agent design that stabilizes decision‑making by having separate agents specialize in extraction, analysis, and strategy selection.
- Empirical evidence that STELLAR reaches near‑optimal performance within the first five tuning attempts on unseen workloads, compared with traditional autotuners that may need thousands of iterations.
- Knowledge‑base feedback loop that captures successful tuning patterns for reuse on future applications, turning each run into a learning experience for the system.
Methodology
- Parameter Extraction – An LLM reads the parallel file system’s manual (e.g., Lustre, GPFS) and builds a structured list of all configurable knobs (stripe size, I/O scheduler, cache policies, etc.).
- Trace Analysis – The application’s I/O trace log is fed to the LLM, which identifies workload characteristics (read/write mix, access patterns, concurrency level).
- Initial Strategy Selection – Using the extracted parameters and trace insights, the LLM proposes a small set of promising configurations (often just one or two).
- Execution & Feedback – The system runs the application with the chosen settings on a real cluster, measures throughput/latency, and records the results.
- Iterative Refinement – The LLM reasons over the performance feedback, adjusts the configuration, and repeats steps 3‑4.
- Knowledge Consolidation – After convergence, the system summarizes the tuning journey into a reusable knowledge entry (e.g., “for write‑heavy, small‑file workloads, stripe size = 64 KB works best”).
The pipeline is orchestrated by a multi‑agent framework:
- Extractor Agent handles documentation parsing.
- Analyzer Agent interprets traces.
- Planner Agent proposes configurations.
- Executor Agent runs the workload and reports metrics.
RAG is used throughout to pull relevant snippets from manuals or prior tuning logs, keeping the LLM’s reasoning anchored to factual data.
Results & Findings
- Speed of convergence: In 90 % of 30 benchmark applications, STELLAR identified a configuration within 5 iterations that was within ± 3 % of the globally optimal throughput (as determined by exhaustive search).
- Search‑space reduction: The LLM‑guided approach cut the effective search space by > 99 % compared with naïve grid or random search.
- Robustness to unseen workloads: Even for applications not represented in the training data, the system’s reasoning based on trace patterns generalized well.
- Ablation study: Removing RAG or the multi‑agent coordination increased the average number of iterations needed from 5 to 27, confirming the importance of grounding and specialization.
Practical Implications
- For system administrators: STELLAR can be deployed as a “plug‑and‑play” service that continuously optimizes storage settings as new jobs arrive, reducing the need for manual tuning expertise.
- For developers of data‑intensive pipelines: Teams can focus on algorithmic work rather than low‑level I/O knobs; the tuner automatically adapts to changing data sizes or access patterns.
- For cloud and HPC providers: Embedding STELLAR into job‑submission portals could improve overall cluster utilization and lower the cost per compute hour by squeezing extra I/O performance without hardware upgrades.
- For other optimization domains: The paper’s architecture (LLM + RAG + multi‑agent loop) is reusable for tuning compilers, network stacks, or even hyper‑parameter selection in machine‑learning pipelines where each evaluation is expensive.
Limitations & Future Work
- Dependence on high‑quality documentation: If the manual is sparse or outdated, parameter extraction may miss critical knobs.
- Scalability of real‑system runs: While the iteration count is low, each iteration still requires a full application execution, which can be costly for very long jobs.
- Hallucination risk: Although mitigated by RAG and multi‑agent checks, occasional incorrect LLM suggestions were observed, especially for obscure parameters.
- Future directions include:
- Integrating simulation‑based proxies to evaluate configurations faster.
- Extending the knowledge base to cross‑cluster environments.
- Exploring fine‑tuned domain‑specific LLMs to further reduce hallucinations.
Authors
- Chris Egersdoerfer
- Philip Carns
- Shane Snyder
- Robert Ross
- Dong Dai
Paper Information
- arXiv ID: 2602.23220v1
- Categories: cs.DC
- Published: February 26, 2026
- PDF: Download PDF