[Paper] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Source: arXiv - 2511.21285v1
Overview
Large language models (LLMs) deliver impressive results but their sheer size makes fine‑tuning expensive in compute, memory, and carbon footprint. The paper PEFT‑Bench introduces a reproducible, end‑to‑end benchmark that lets researchers and engineers compare parameter‑efficient fine‑tuning (PEFT) techniques across many tasks and models, while also accounting for speed, memory, and the number of trainable parameters.
Key Contributions
- PEFT‑Bench suite: a unified framework that automates data loading, model preparation, training, and evaluation for six popular PEFT methods on six autoregressive LLMs.
- Broad coverage: experiments run on 27 downstream NLP datasets spanning classification, generation, and reasoning tasks.
- New composite metric – PEFT Soft Score Penalties (PSCP): combines downstream accuracy with penalties for trainable‑parameter count, inference latency, and peak training memory, giving a single “efficiency‑aware” score.
- Open‑source release: code, configs, and Docker images are publicly available, lowering the barrier for reproducibility and future extensions.
- Empirical insights: systematic comparison reveals trade‑offs between different PEFT families (adapter‑based, prompt‑tuning, LoRA, etc.) that were previously scattered across papers.
Methodology
-
Model & PEFT selection – The authors pick six widely used autoregressive LLMs (e.g., GPT‑2‑XL, LLaMA‑7B) and six PEFT strategies:
- Adapter modules
- Prefix‑tuning
- Prompt‑tuning
- LoRA (Low‑Rank Adaptation)
- BitFit (bias‑only fine‑tuning)
- IA³ (Infused Adapter)
-
Dataset pipeline – A unified data loader normalizes 27 benchmark datasets (GLUE, SuperGLUE, XSum, etc.) into a common format, handling tokenization, train/validation splits, and task‑specific metrics.
-
Training loop – PEFT‑Bench wraps the Hugging Face Trainer, automatically freezing the base model weights and exposing only the PEFT parameters. Hyper‑parameters (learning rate, epochs, batch size) are kept constant across methods to ensure a fair comparison.
-
Evaluation & PSCP – After fine‑tuning, each run is measured for:
- Task performance (accuracy, F1, ROUGE, etc.)
- Trainable parameter count
- Inference latency (average time per token on a single GPU)
- Peak training memory (GPU memory footprint)
The PSCP score is computed as:
$$\text{PSCP}= \text{TaskScore} \times \exp\bigl(-\alpha\frac{P}{P_{\max}} - \beta\frac{L}{L_{\max}} - \gamma\frac{M}{M_{\max}}\bigr)$$
where (P), (L), and (M) are the three efficiency factors, and (\alpha,\beta,\gamma) are tunable weights (default = 1).
-
Reproducibility – All experiments are containerized; random seeds, hardware specs, and logs are recorded automatically.
Results & Findings
| PEFT method | Avg. task score (↑) | Avg. trainable % | Inference slowdown | Peak memory (GB) |
|---|---|---|---|---|
| LoRA | 84.2 | 0.5 % | +3 % | 12.1 |
| Adapter | 82.7 | 1.2 % | +5 % | 13.5 |
| IA³ | 81.9 | 0.8 % | +4 % | 12.8 |
| Prefix‑tuning | 80.4 | 1.0 % | +7 % | 13.9 |
| Prompt‑tuning | 78.6 | 0.3 % | +2 % | 11.9 |
| BitFit | 75.3 | 0.1 % | +1 % | 11.5 |
- Performance vs. efficiency: LoRA consistently hits the highest PSCP because it balances a modest parameter increase with minimal latency and memory overhead.
- Task variance: Prompt‑tuning shines on generation‑heavy tasks (e.g., summarization) where a tiny prompt can steer the model, while adapters are more robust on classification benchmarks.
- Scaling behavior: As model size grows, the relative memory savings of PEFT become more pronounced, making PEFT increasingly attractive for 30B‑plus models.
Practical Implications
- Faster iteration cycles: Developers can fine‑tune a 7B‑parameter LLM on a single GPU in under an hour using LoRA, cutting down experimentation time dramatically.
- Cost‑effective deployment: Since inference speed is barely impacted, production services can serve PEFT‑tuned models without needing extra hardware, translating to lower cloud bills and reduced carbon emissions.
- Modular updates: PEFT layers are lightweight files (often < 10 MB) that can be swapped or version‑controlled independently of the massive base model, simplifying A/B testing and continuous delivery pipelines.
- Edge‑friendly scenarios: For on‑device or low‑resource environments, prompt‑tuning or BitFit can enable personalization without storing the full fine‑tuned checkpoint.
- Benchmark as a service: The open‑source PEFT‑Bench can be integrated into CI/CD workflows to automatically evaluate new PEFT ideas against a standardized suite, ensuring fair comparisons before shipping to customers.
Limitations & Future Work
- Fixed hyper‑parameters: To keep the comparison clean, the authors used a single learning‑rate schedule across all methods; task‑specific tuning could shift rankings.
- Model diversity: Only autoregressive LLMs were examined; encoder‑only or encoder‑decoder architectures (e.g., BERT, T5) may exhibit different PEFT dynamics.
- PSCP weighting: The penalty weights ((\alpha,\beta,\gamma)) are currently set heuristically; exploring domain‑specific weightings (e.g., latency‑critical vs. memory‑critical use cases) is an open direction.
- Long‑context tasks: Benchmarks did not include very long‑context scenarios (e.g., retrieval‑augmented generation), where some PEFT methods might behave differently.
Future work could extend PEFT‑Bench to multimodal models, incorporate automated hyper‑parameter search for each PEFT variant, and provide a leaderboard that tracks community submissions.
If you’re looking to experiment with cheap yet powerful fine‑tuning for your own LLM projects, PEFT‑Bench offers a ready‑made playground. Clone the repo, pick your favorite PEFT method, and let the PSCP score guide you toward the most efficient solution for your workload.
Authors
- Robert Belanec
- Branislav Pecher
- Ivan Srba
- Maria Bielikova
Paper Information
- arXiv ID: 2511.21285v1
- Categories: cs.CL
- Published: November 26, 2025
- PDF: Download PDF