[Paper] GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Published: 2 months ago (December 3, 2025 at 05:17 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04296v1

Overview

The paper introduces GRASP (Grouped Activation Shared Parameterization), a new parameter‑efficient fine‑tuning (PEFT) technique for large transformer models. By grouping token activations and learning a tiny set of shared scaling/shift parameters per group, GRASP slashes the number of trainable weights while still capturing task‑specific nuances. A stochastic extension, StochGRASP, further models weight uncertainty, making the fine‑tuned model more resilient to hardware‑level noise—an attractive property for edge AI deployments.

Key Contributions

Grouped modulation: Partitions each token representation into K ≪ D groups and learns a shared scale‑and‑shift vector per group, drastically reducing trainable parameters.
StochGRASP: Extends GRASP with Gaussian perturbations on the shared parameters and a noise‑aware loss, enabling robustness to weight noise during inference.
Parameter efficiency: Achieves up to 10× fewer trainable parameters than popular PEFT methods such as LoRA and BitFit.
Competitive performance: Matches or exceeds state‑of‑the‑art PEFT results on GLUE (RoBERTa‑base/large) and E2E NLG (GPT‑2 Medium).
Robustness to hardware variability: Demonstrates consistent accuracy gains under simulated inference noise, positioning StochGRASP for low‑power, emerging AI chips.

Methodology

Activation grouping – For a selected transformer layer, the D-dimensional hidden vector of each token is split into K contiguous groups (e.g., D = 768, K = 8 → groups of size 96).
Shared scaling & shifting – Each group receives a single learnable scale vector γₖ and shift vector βₖ. During fine‑tuning, the original hidden vector h is transformed as:

[ \tilde{h}{i,,g} = \gamma_g \odot h{i,,g} + \beta_g ]

where g indexes the group and i the token.
Parameter count – Instead of updating the full weight matrices (millions of parameters), only the 2 × K × (D/K) vectors are trained, yielding an order‑of‑magnitude reduction.
StochGRASP – Replaces deterministic γ, β with Gaussian distributions (mean + σ·ε). The loss incorporates the expected noise, encouraging the model to learn parameters that are stable under random perturbations.
Training – Standard downstream task loss (e.g., cross‑entropy) plus a regularizer that penalizes large variance in the stochastic parameters. Fine‑tuning proceeds exactly like any other PEFT method, requiring only a few epochs.

Results & Findings

Model / Dataset	Params (trainable)	GLUE Avg. Score	GPT‑2 NLG BLEU
LoRA (baseline)	0.5 % of total	84.2	27.1
BitFit	0.2 %	83.8	26.9
GRASP	0.05 %	84.5 (↑0.3)	27.3 (↑0.2)
StochGRASP	0.07 %	84.7 (↑0.5)	27.6 (↑0.5)

Parameter reduction: GRASP uses ~10× fewer trainable weights than LoRA while delivering comparable or better accuracy.
Noise robustness: When synthetic Gaussian noise (σ = 0.01–0.05) is injected into the model weights at inference time, StochGRASP’s accuracy drops <1 % versus >3 % for deterministic baselines.
Scalability: Experiments on both RoBERTa‑base (125 M) and RoBERTa‑large (355 M) confirm that the grouping strategy scales without needing to retune K per model size.

Practical Implications

Edge deployment: The tiny trainable footprint means fine‑tuned models can be stored and updated on devices with limited flash (e.g., microcontrollers, ASICs) while still benefiting from large pre‑trained backbones.
Energy‑efficient inference: StochGRASP’s robustness to weight noise aligns with the stochastic nature of emerging low‑precision AI accelerators (e.g., analog in‑memory computing), reducing the need for costly error‑correction circuitry.
Rapid iteration: Because only a handful of parameters change, developers can experiment with many downstream tasks on the same base model without re‑training the full network, shortening time‑to‑market.
Compatibility: GRASP plugs into existing transformer libraries (Hugging Face, PyTorch) with minimal code changes—just specify the layers to group and the group count.

Limitations & Future Work

Group granularity trade‑off: Very aggressive grouping (tiny K) may under‑fit complex tasks; the paper reports a modest sensitivity analysis but leaves automated group‑size selection for future research.
Hardware validation: Robustness is demonstrated with simulated noise; real‑world tests on analog or low‑precision chips are needed to confirm the gains.
Extension to vision transformers: The current study focuses on NLP models; applying GRASP to ViT or multimodal transformers could uncover new efficiency frontiers.

Overall, GRASP and its stochastic variant offer a compelling blend of parameter efficiency and hardware resilience, making them a practical tool for developers aiming to bring large transformer capabilities to resource‑constrained environments.

Authors

Malyaban Bal
Abhronil Sengupta

Paper Information

arXiv ID: 2512.04296v1
Categories: cs.LG, cs.NE
Published: December 3, 2025
PDF: Download PDF

[Paper] GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement