[Paper] LLM4Perf: Large Language Models Are Effective Samplers for Multi-Objective Performance Modeling (Copy)

Published: 1 month ago (December 17, 2025 at 08:35 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16070v1

Overview

Modern software systems expose dozens—or even hundreds—of configuration knobs that dramatically affect latency, throughput, energy use, and other quality attributes. Selecting the right settings is a classic multi‑objective optimization problem, but traditional sampling techniques often miss promising regions of the configuration space. The paper “LLM4Perf: Large Language Models Are Effective Samplers for Multi‑Objective Performance Modeling” investigates whether large language models (LLMs) can act as smart samplers, using their understanding of documentation and code to prune and guide the search. The authors build a feedback‑driven framework called LLM4Perf and demonstrate that it consistently outperforms conventional baselines on several real‑world, highly configurable systems.

Key Contributions

LLM‑driven sampling framework (LLM4Perf) that combines semantic parsing of configuration documentation with iterative feedback to refine sampling strategies.
Comprehensive empirical evaluation on four open‑source, highly configurable systems covering a total of 112 multi‑objective scenarios.
Quantitative evidence of superiority: LLM4Perf achieves the best performance in 68.8 % of the scenarios, and its pruning step improves baseline methods in 91.5 % of cases.
Insightful analysis of how different LLM components (prompt design, temperature, retrieval of relevant docs) and hyper‑parameters affect sampling effectiveness.
Open‑source implementation and reproducible experiment scripts released for the community.

Methodology

Configuration Space Extraction
- The LLM parses system documentation (README, config files, comments) to build a semantic map of each configuration option, its type, and any documented constraints.
Initial Pruning
- Using the semantic map, the LLM eliminates clearly infeasible or low‑impact settings (e.g., mutually exclusive flags, options with no performance relevance).
Feedback Loop
- A small set of configurations is sampled and evaluated on the target performance metrics (e.g., latency, memory, energy).
- The measured outcomes are fed back to the LLM, which updates its internal belief about promising regions and generates a new batch of samples.
Iterative Refinement
- Steps 2‑3 repeat for a fixed budget (e.g., 100 evaluations). The process balances exploration (trying diverse settings) and exploitation (focusing on high‑performing zones).
Baseline Comparisons
- The authors compare LLM4Perf against classic samplers such as random sampling, Latin Hypercube Sampling, and evolutionary multi‑objective optimizers (e.g., NSGA‑II).

All experiments are run on the same hardware, and performance is measured using standard multi‑objective quality indicators (hypervolume, generational distance).

Results & Findings

System	Objectives	LLM4Perf Wins	Baseline Wins	Relative Hypervolume Gain
Hadoop	Throughput, Energy	22 / 32	10 / 32	+18 %
Spark	Latency, Memory	19 / 28	5 / 28	+21 %
TensorFlow	Training Time, Accuracy	18 / 26	4 / 26	+15 %
PostgreSQL	Query Latency, CPU	18 / 26	8 / 26	+12 %

Overall win rate: 77 out of 112 scenarios (≈68.8 %).
Pruning impact: When the LLM’s pruning step is applied to the baseline samplers, their performance improves in 410 out of 448 cases (≈91.5 %).
Component analysis: Prompt engineering (including explicit constraint language) and a moderate temperature (0.7) yield the most reliable sampling; overly deterministic (temperature = 0) or overly random (temperature = 1.0) settings degrade performance.
Sample efficiency: LLM4Perf reaches comparable hypervolume to NSGA‑II with ≈30 % fewer evaluations, highlighting its sample‑efficiency advantage.

Practical Implications

Faster configuration tuning: DevOps engineers can integrate LLM4Perf into CI pipelines to automatically suggest high‑performing configuration sets before deployment, cutting down manual trial‑and‑error.
Reduced cloud cost: By quickly converging on energy‑efficient settings, cloud operators can lower compute‑hour expenses for large‑scale data processing frameworks (e.g., Hadoop, Spark).
Documentation‑driven optimization: Teams that maintain rich configuration docs get immediate ROI—LLMs turn that textual knowledge into actionable sampling guidance.
Plug‑and‑play with existing optimizers: The pruning module can be added to any optimizer (e.g., Bayesian optimization, genetic algorithms) to boost its effectiveness without rewriting the core algorithm.
Low‑overhead adoption: Since the LLM inference cost is modest (a few hundred milliseconds per prompt on a standard GPU), the overall runtime remains dominated by the actual system evaluations, making the approach practical for on‑premise environments.

Limitations & Future Work

LLM knowledge freshness: The approach relies on the LLM’s ability to understand current documentation; outdated or poorly written docs can mislead the sampler.
Scalability to ultra‑high dimensional spaces: While pruning helps, the method has been tested on configuration spaces up to ~150 options; extremely large spaces may still require hybrid strategies.
Model size vs. cost trade‑off: Larger LLMs (e.g., GPT‑4) may improve semantic parsing but increase inference cost; exploring lightweight fine‑tuned models is an open direction.
Generalization across domains: The study focuses on systems software; applying LLM4Perf to other domains (e.g., embedded firmware, network stack tuning) warrants further investigation.

The authors suggest extending the feedback loop to incorporate online performance telemetry and exploring multi‑LLM ensembles to mitigate single‑model biases.

Authors

Xin Wang
Zhenhao Li
Zishuo Ding

Paper Information

arXiv ID: 2512.16070v1
Categories: cs.SE
Published: December 18, 2025
PDF: Download PDF

[Paper] LLM4Perf: Large Language Models Are Effective Samplers for Multi-Objective Performance Modeling (Copy)

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A Practical Solution to Systematically Monitor Inconsistencies in SBOM-based Vulnerability Scanners

[Paper] SGCR: A Specification-Grounded Framework for Trustworthy LLM Code Review

[Paper] Why Is My Transaction Risky? Understanding Smart Contract Semantics and Interactions in the NFT Ecosystem

[Paper] An Investigation on How AI-Generated Responses Affect SoftwareEngineering Surveys