[Paper] LLM4Perf: Large Language Models Are Effective Samplers for Multi-Objective Performance Modeling (Copy)
Source: arXiv - 2512.16070v1
Overview
Modern software systems expose dozens—or even hundreds—of configuration knobs that dramatically affect latency, throughput, energy use, and other quality attributes. Selecting the right settings is a classic multi‑objective optimization problem, but traditional sampling techniques often miss promising regions of the configuration space. The paper “LLM4Perf: Large Language Models Are Effective Samplers for Multi‑Objective Performance Modeling” investigates whether large language models (LLMs) can act as smart samplers, using their understanding of documentation and code to prune and guide the search. The authors build a feedback‑driven framework called LLM4Perf and demonstrate that it consistently outperforms conventional baselines on several real‑world, highly configurable systems.
Key Contributions
- LLM‑driven sampling framework (LLM4Perf) that combines semantic parsing of configuration documentation with iterative feedback to refine sampling strategies.
- Comprehensive empirical evaluation on four open‑source, highly configurable systems covering a total of 112 multi‑objective scenarios.
- Quantitative evidence of superiority: LLM4Perf achieves the best performance in 68.8 % of the scenarios, and its pruning step improves baseline methods in 91.5 % of cases.
- Insightful analysis of how different LLM components (prompt design, temperature, retrieval of relevant docs) and hyper‑parameters affect sampling effectiveness.
- Open‑source implementation and reproducible experiment scripts released for the community.
Methodology
-
Configuration Space Extraction
- The LLM parses system documentation (README, config files, comments) to build a semantic map of each configuration option, its type, and any documented constraints.
-
Initial Pruning
- Using the semantic map, the LLM eliminates clearly infeasible or low‑impact settings (e.g., mutually exclusive flags, options with no performance relevance).
-
Feedback Loop
- A small set of configurations is sampled and evaluated on the target performance metrics (e.g., latency, memory, energy).
- The measured outcomes are fed back to the LLM, which updates its internal belief about promising regions and generates a new batch of samples.
-
Iterative Refinement
- Steps 2‑3 repeat for a fixed budget (e.g., 100 evaluations). The process balances exploration (trying diverse settings) and exploitation (focusing on high‑performing zones).
-
Baseline Comparisons
- The authors compare LLM4Perf against classic samplers such as random sampling, Latin Hypercube Sampling, and evolutionary multi‑objective optimizers (e.g., NSGA‑II).
All experiments are run on the same hardware, and performance is measured using standard multi‑objective quality indicators (hypervolume, generational distance).
Results & Findings
| System | Objectives | LLM4Perf Wins | Baseline Wins | Relative Hypervolume Gain |
|---|---|---|---|---|
| Hadoop | Throughput, Energy | 22 / 32 | 10 / 32 | +18 % |
| Spark | Latency, Memory | 19 / 28 | 5 / 28 | +21 % |
| TensorFlow | Training Time, Accuracy | 18 / 26 | 4 / 26 | +15 % |
| PostgreSQL | Query Latency, CPU | 18 / 26 | 8 / 26 | +12 % |
- Overall win rate: 77 out of 112 scenarios (≈68.8 %).
- Pruning impact: When the LLM’s pruning step is applied to the baseline samplers, their performance improves in 410 out of 448 cases (≈91.5 %).
- Component analysis: Prompt engineering (including explicit constraint language) and a moderate temperature (0.7) yield the most reliable sampling; overly deterministic (temperature = 0) or overly random (temperature = 1.0) settings degrade performance.
- Sample efficiency: LLM4Perf reaches comparable hypervolume to NSGA‑II with ≈30 % fewer evaluations, highlighting its sample‑efficiency advantage.
Practical Implications
- Faster configuration tuning: DevOps engineers can integrate LLM4Perf into CI pipelines to automatically suggest high‑performing configuration sets before deployment, cutting down manual trial‑and‑error.
- Reduced cloud cost: By quickly converging on energy‑efficient settings, cloud operators can lower compute‑hour expenses for large‑scale data processing frameworks (e.g., Hadoop, Spark).
- Documentation‑driven optimization: Teams that maintain rich configuration docs get immediate ROI—LLMs turn that textual knowledge into actionable sampling guidance.
- Plug‑and‑play with existing optimizers: The pruning module can be added to any optimizer (e.g., Bayesian optimization, genetic algorithms) to boost its effectiveness without rewriting the core algorithm.
- Low‑overhead adoption: Since the LLM inference cost is modest (a few hundred milliseconds per prompt on a standard GPU), the overall runtime remains dominated by the actual system evaluations, making the approach practical for on‑premise environments.
Limitations & Future Work
- LLM knowledge freshness: The approach relies on the LLM’s ability to understand current documentation; outdated or poorly written docs can mislead the sampler.
- Scalability to ultra‑high dimensional spaces: While pruning helps, the method has been tested on configuration spaces up to ~150 options; extremely large spaces may still require hybrid strategies.
- Model size vs. cost trade‑off: Larger LLMs (e.g., GPT‑4) may improve semantic parsing but increase inference cost; exploring lightweight fine‑tuned models is an open direction.
- Generalization across domains: The study focuses on systems software; applying LLM4Perf to other domains (e.g., embedded firmware, network stack tuning) warrants further investigation.
The authors suggest extending the feedback loop to incorporate online performance telemetry and exploring multi‑LLM ensembles to mitigate single‑model biases.
Authors
- Xin Wang
- Zhenhao Li
- Zishuo Ding
Paper Information
- arXiv ID: 2512.16070v1
- Categories: cs.SE
- Published: December 18, 2025
- PDF: Download PDF