[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Published: 16 hours ago (April 23, 2026 at 01:48 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21901v1

Overview

The paper GiVA: Gradient‑Informed Bases for Vector‑Based Adaptation tackles a growing pain point in modern deep learning: fine‑tuning massive models without blowing up memory or compute budgets. While LoRA (Low‑Rank Adaptation) has become the de‑facto standard for parameter‑efficient fine‑tuning, newer vector‑based adapters promise even smaller storage footprints—at the cost of needing much higher ranks (i.e., more parameters) to hit LoRA‑level performance. GiVA introduces a clever gradient‑driven initialization that lets vector adapters achieve LoRA‑grade results with ≈8× fewer parameters, while keeping training speed on par with LoRA.

Key Contributions

Gradient‑Informed Basis (GiVA): a systematic way to seed vector adapters using the direction of the loss gradient, dramatically improving their expressive power from the start.
Rank Reduction: Demonstrates that vector adapters can work with ranks up to eight times smaller than prior vector‑based methods while matching or surpassing their accuracy.
Broad Empirical Validation: Benchmarks span NLP (GLUE, SQuAD, summarization), generative tasks (GPT‑2 finetuning), and vision (ImageNet classification), showing consistent gains across modalities.
Training Efficiency: Keeps per‑step compute and wall‑clock time comparable to LoRA, avoiding the slowdown that typically plagues high‑rank vector adapters.
Open‑Source Toolkit: Authors release a lightweight PyTorch library that plugs into existing LoRA‑style pipelines with minimal code changes.

Methodology

Vector‑Based Adaptation Recap
- Instead of learning a low‑rank matrix ΔW = A Bᵀ (as in LoRA), vector adapters store a set of basis vectors v₁ … vₖ and learn scalar coefficients α per downstream task. The effective weight change is a linear combination of these vectors.
Problem with Random Init
- Randomly initialized vectors are orthogonal to the loss landscape, forcing the optimizer to “discover” useful directions, which requires a large k (rank).
Gradient‑Informed Initialization
- GiVA computes the gradient of the loss w.r.t. the frozen pretrained weights on a small proxy batch.
- It then performs a truncated SVD on this gradient matrix, extracting the top‑k singular vectors. These vectors become the initial basis v₁ … vₖ.
- Because the basis already aligns with the steepest descent directions, the adapter can achieve high performance with far fewer vectors.
Training Loop
- The pretrained backbone stays frozen. Only the scalar coefficients α (and optionally a tiny bias) are updated during fine‑tuning.
- Standard AdamW (or any optimizer) can be used; no extra hyper‑parameter tuning beyond LoRA’s learning rate is needed.

The whole pipeline is a drop‑in replacement for LoRA: swap the LoRA module with GiVAAdapter(rank=k) and you’re ready to go.

Results & Findings

Task	Baseline (Full FT)	LoRA (rank = 8)	Vector‑Adapter (random, rank = 64)	GiVA (rank = 8)
GLUE‑MNLI	84.5%	84.2%	81.0%	84.0%
SQuAD‑v2 F1	88.3	88.0	84.5	87.9
GPT‑2 Summarization (ROUGE‑L)	31.2	30.9	28.4	30.7
ImageNet (Top‑1)	78.5%	78.1%	75.3%	77.9%

Parameter Savings: GiVA uses ~1 % of the parameters LoRA needs for the same rank, thanks to the 8× rank reduction.
Training Time: Wall‑clock per epoch is within 5 % of LoRA, far better than the 2–3× slowdown observed with high‑rank vector adapters.
Stability: Across random seeds, GiVA’s variance is lower than both LoRA and random vector adapters, indicating a more robust initialization.

Practical Implications

Edge & Mobile Deployments: The tiny adapter footprint (often < 0.1 % of the base model size) makes it feasible to ship a single large foundation model with multiple task‑specific adapters on devices with strict storage limits.
Rapid Prototyping: Developers can spin up new fine‑tuned variants in minutes without worrying about GPU memory spikes, because the backbone stays frozen and the adapter is minuscule.
Multi‑Task Serving: A single server can host dozens of GiVA adapters for different customers or languages, swapping only the scalar coefficient tensors at inference time.
Cost‑Effective MLOps: Lower rank means fewer parameters to checkpoint, version, and transfer, reducing storage and network overhead in CI/CD pipelines.
Compatibility: GiVA works with any transformer‑style model (BERT, T5, LLaMA, ViT, etc.) and integrates with popular libraries (🤗 Transformers, PEFT).

Limitations & Future Work

Gradient Proxy Quality: GiVA relies on a representative batch to compute the initial gradient. If the proxy data is biased, the basis may miss important directions, leading to sub‑optimal performance.
Static Basis: Once initialized, the basis vectors are frozen. The authors note that allowing a small amount of basis fine‑tuning could further close the gap to full fine‑tuning on very niche tasks.
Scalability of SVD: Computing a truncated SVD on the full gradient matrix can be memory‑intensive for extremely large models (e.g., > 10 B parameters). Future work could explore randomized SVD or low‑rank approximation tricks.
Beyond Transformers: Experiments focus on transformer‑based NLP and vision models; applying GiVA to diffusion models or graph neural networks remains an open question.

Authors

Neeraj Gangwar
Rishabh Deshmukh
Michael Shavlovsky
Hancao Li
Vivek Mittal
Lexing Ying
Nickvash Kani

Paper Information

arXiv ID: 2604.21901v1
Categories: cs.CL, cs.AI
Published: April 23, 2026
PDF: Download PDF

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

[Paper] A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

[Paper] SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation