[Paper] GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers
Source: arXiv - 2512.04296v1
Overview
The paper introduces GRASP (Grouped Activation Shared Parameterization), a new parameter‑efficient fine‑tuning (PEFT) technique for large transformer models. By grouping token activations and learning a tiny set of shared scaling/shift parameters per group, GRASP slashes the number of trainable weights while still capturing task‑specific nuances. A stochastic extension, StochGRASP, further models weight uncertainty, making the fine‑tuned model more resilient to hardware‑level noise—an attractive property for edge AI deployments.
Key Contributions
- Grouped modulation: Partitions each token representation into K ≪ D groups and learns a shared scale‑and‑shift vector per group, drastically reducing trainable parameters.
- StochGRASP: Extends GRASP with Gaussian perturbations on the shared parameters and a noise‑aware loss, enabling robustness to weight noise during inference.
- Parameter efficiency: Achieves up to 10× fewer trainable parameters than popular PEFT methods such as LoRA and BitFit.
- Competitive performance: Matches or exceeds state‑of‑the‑art PEFT results on GLUE (RoBERTa‑base/large) and E2E NLG (GPT‑2 Medium).
- Robustness to hardware variability: Demonstrates consistent accuracy gains under simulated inference noise, positioning StochGRASP for low‑power, emerging AI chips.
Methodology
-
Activation grouping – For a selected transformer layer, the D-dimensional hidden vector of each token is split into K contiguous groups (e.g., D = 768, K = 8 → groups of size 96).
-
Shared scaling & shifting – Each group receives a single learnable scale vector γₖ and shift vector βₖ. During fine‑tuning, the original hidden vector h is transformed as:
[ \tilde{h}{i,,g} = \gamma_g \odot h{i,,g} + \beta_g ]
where g indexes the group and i the token.
-
Parameter count – Instead of updating the full weight matrices (millions of parameters), only the 2 × K × (D/K) vectors are trained, yielding an order‑of‑magnitude reduction.
-
StochGRASP – Replaces deterministic γ, β with Gaussian distributions (mean + σ·ε). The loss incorporates the expected noise, encouraging the model to learn parameters that are stable under random perturbations.
-
Training – Standard downstream task loss (e.g., cross‑entropy) plus a regularizer that penalizes large variance in the stochastic parameters. Fine‑tuning proceeds exactly like any other PEFT method, requiring only a few epochs.
Results & Findings
| Model / Dataset | Params (trainable) | GLUE Avg. Score | GPT‑2 NLG BLEU |
|---|---|---|---|
| LoRA (baseline) | 0.5 % of total | 84.2 | 27.1 |
| BitFit | 0.2 % | 83.8 | 26.9 |
| GRASP | 0.05 % | 84.5 (↑0.3) | 27.3 (↑0.2) |
| StochGRASP | 0.07 % | 84.7 (↑0.5) | 27.6 (↑0.5) |
- Parameter reduction: GRASP uses ~10× fewer trainable weights than LoRA while delivering comparable or better accuracy.
- Noise robustness: When synthetic Gaussian noise (σ = 0.01–0.05) is injected into the model weights at inference time, StochGRASP’s accuracy drops <1 % versus >3 % for deterministic baselines.
- Scalability: Experiments on both RoBERTa‑base (125 M) and RoBERTa‑large (355 M) confirm that the grouping strategy scales without needing to retune K per model size.
Practical Implications
- Edge deployment: The tiny trainable footprint means fine‑tuned models can be stored and updated on devices with limited flash (e.g., microcontrollers, ASICs) while still benefiting from large pre‑trained backbones.
- Energy‑efficient inference: StochGRASP’s robustness to weight noise aligns with the stochastic nature of emerging low‑precision AI accelerators (e.g., analog in‑memory computing), reducing the need for costly error‑correction circuitry.
- Rapid iteration: Because only a handful of parameters change, developers can experiment with many downstream tasks on the same base model without re‑training the full network, shortening time‑to‑market.
- Compatibility: GRASP plugs into existing transformer libraries (Hugging Face, PyTorch) with minimal code changes—just specify the layers to group and the group count.
Limitations & Future Work
- Group granularity trade‑off: Very aggressive grouping (tiny K) may under‑fit complex tasks; the paper reports a modest sensitivity analysis but leaves automated group‑size selection for future research.
- Hardware validation: Robustness is demonstrated with simulated noise; real‑world tests on analog or low‑precision chips are needed to confirm the gains.
- Extension to vision transformers: The current study focuses on NLP models; applying GRASP to ViT or multimodal transformers could uncover new efficiency frontiers.
Overall, GRASP and its stochastic variant offer a compelling blend of parameter efficiency and hardware resilience, making them a practical tool for developers aiming to bring large transformer capabilities to resource‑constrained environments.
Authors
- Malyaban Bal
- Abhronil Sengupta
Paper Information
- arXiv ID: 2512.04296v1
- Categories: cs.LG, cs.NE
- Published: December 3, 2025
- PDF: Download PDF