[Paper] GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Published: (December 3, 2025 at 05:17 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04296v1

Overview

The paper introduces GRASP (Grouped Activation Shared Parameterization), a new parameter‑efficient fine‑tuning (PEFT) technique for large transformer models. By grouping token activations and learning a tiny set of shared scaling/shift parameters per group, GRASP slashes the number of trainable weights while still capturing task‑specific nuances. A stochastic extension, StochGRASP, further models weight uncertainty, making the fine‑tuned model more resilient to hardware‑level noise—an attractive property for edge AI deployments.

Key Contributions

  • Grouped modulation: Partitions each token representation into K ≪ D groups and learns a shared scale‑and‑shift vector per group, drastically reducing trainable parameters.
  • StochGRASP: Extends GRASP with Gaussian perturbations on the shared parameters and a noise‑aware loss, enabling robustness to weight noise during inference.
  • Parameter efficiency: Achieves up to 10× fewer trainable parameters than popular PEFT methods such as LoRA and BitFit.
  • Competitive performance: Matches or exceeds state‑of‑the‑art PEFT results on GLUE (RoBERTa‑base/large) and E2E NLG (GPT‑2 Medium).
  • Robustness to hardware variability: Demonstrates consistent accuracy gains under simulated inference noise, positioning StochGRASP for low‑power, emerging AI chips.

Methodology

  1. Activation grouping – For a selected transformer layer, the D-dimensional hidden vector of each token is split into K contiguous groups (e.g., D = 768, K = 8 → groups of size 96).

  2. Shared scaling & shifting – Each group receives a single learnable scale vector γₖ and shift vector βₖ. During fine‑tuning, the original hidden vector h is transformed as:

    [ \tilde{h}{i,,g} = \gamma_g \odot h{i,,g} + \beta_g ]

    where g indexes the group and i the token.

  3. Parameter count – Instead of updating the full weight matrices (millions of parameters), only the 2 × K × (D/K) vectors are trained, yielding an order‑of‑magnitude reduction.

  4. StochGRASP – Replaces deterministic γ, β with Gaussian distributions (mean + σ·ε). The loss incorporates the expected noise, encouraging the model to learn parameters that are stable under random perturbations.

  5. Training – Standard downstream task loss (e.g., cross‑entropy) plus a regularizer that penalizes large variance in the stochastic parameters. Fine‑tuning proceeds exactly like any other PEFT method, requiring only a few epochs.

Results & Findings

Model / DatasetParams (trainable)GLUE Avg. ScoreGPT‑2 NLG BLEU
LoRA (baseline)0.5 % of total84.227.1
BitFit0.2 %83.826.9
GRASP0.05 %84.5 (↑0.3)27.3 (↑0.2)
StochGRASP0.07 %84.7 (↑0.5)27.6 (↑0.5)
  • Parameter reduction: GRASP uses ~10× fewer trainable weights than LoRA while delivering comparable or better accuracy.
  • Noise robustness: When synthetic Gaussian noise (σ = 0.01–0.05) is injected into the model weights at inference time, StochGRASP’s accuracy drops <1 % versus >3 % for deterministic baselines.
  • Scalability: Experiments on both RoBERTa‑base (125 M) and RoBERTa‑large (355 M) confirm that the grouping strategy scales without needing to retune K per model size.

Practical Implications

  • Edge deployment: The tiny trainable footprint means fine‑tuned models can be stored and updated on devices with limited flash (e.g., microcontrollers, ASICs) while still benefiting from large pre‑trained backbones.
  • Energy‑efficient inference: StochGRASP’s robustness to weight noise aligns with the stochastic nature of emerging low‑precision AI accelerators (e.g., analog in‑memory computing), reducing the need for costly error‑correction circuitry.
  • Rapid iteration: Because only a handful of parameters change, developers can experiment with many downstream tasks on the same base model without re‑training the full network, shortening time‑to‑market.
  • Compatibility: GRASP plugs into existing transformer libraries (Hugging Face, PyTorch) with minimal code changes—just specify the layers to group and the group count.

Limitations & Future Work

  • Group granularity trade‑off: Very aggressive grouping (tiny K) may under‑fit complex tasks; the paper reports a modest sensitivity analysis but leaves automated group‑size selection for future research.
  • Hardware validation: Robustness is demonstrated with simulated noise; real‑world tests on analog or low‑precision chips are needed to confirm the gains.
  • Extension to vision transformers: The current study focuses on NLP models; applying GRASP to ViT or multimodal transformers could uncover new efficiency frontiers.

Overall, GRASP and its stochastic variant offer a compelling blend of parameter efficiency and hardware resilience, making them a practical tool for developers aiming to bring large transformer capabilities to resource‑constrained environments.

Authors

  • Malyaban Bal
  • Abhronil Sengupta

Paper Information

  • arXiv ID: 2512.04296v1
  • Categories: cs.LG, cs.NE
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »