[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation
Source: arXiv - 2604.21901v1
Overview
The paper GiVA: Gradient‑Informed Bases for Vector‑Based Adaptation tackles a growing pain point in modern deep learning: fine‑tuning massive models without blowing up memory or compute budgets. While LoRA (Low‑Rank Adaptation) has become the de‑facto standard for parameter‑efficient fine‑tuning, newer vector‑based adapters promise even smaller storage footprints—at the cost of needing much higher ranks (i.e., more parameters) to hit LoRA‑level performance. GiVA introduces a clever gradient‑driven initialization that lets vector adapters achieve LoRA‑grade results with ≈8× fewer parameters, while keeping training speed on par with LoRA.
Key Contributions
- Gradient‑Informed Basis (GiVA): a systematic way to seed vector adapters using the direction of the loss gradient, dramatically improving their expressive power from the start.
- Rank Reduction: Demonstrates that vector adapters can work with ranks up to eight times smaller than prior vector‑based methods while matching or surpassing their accuracy.
- Broad Empirical Validation: Benchmarks span NLP (GLUE, SQuAD, summarization), generative tasks (GPT‑2 finetuning), and vision (ImageNet classification), showing consistent gains across modalities.
- Training Efficiency: Keeps per‑step compute and wall‑clock time comparable to LoRA, avoiding the slowdown that typically plagues high‑rank vector adapters.
- Open‑Source Toolkit: Authors release a lightweight PyTorch library that plugs into existing LoRA‑style pipelines with minimal code changes.
Methodology
- Vector‑Based Adaptation Recap
- Instead of learning a low‑rank matrix ΔW = A Bᵀ (as in LoRA), vector adapters store a set of basis vectors v₁ … vₖ and learn scalar coefficients α per downstream task. The effective weight change is a linear combination of these vectors.
- Problem with Random Init
- Randomly initialized vectors are orthogonal to the loss landscape, forcing the optimizer to “discover” useful directions, which requires a large k (rank).
- Gradient‑Informed Initialization
- GiVA computes the gradient of the loss w.r.t. the frozen pretrained weights on a small proxy batch.
- It then performs a truncated SVD on this gradient matrix, extracting the top‑k singular vectors. These vectors become the initial basis v₁ … vₖ.
- Because the basis already aligns with the steepest descent directions, the adapter can achieve high performance with far fewer vectors.
- Training Loop
- The pretrained backbone stays frozen. Only the scalar coefficients α (and optionally a tiny bias) are updated during fine‑tuning.
- Standard AdamW (or any optimizer) can be used; no extra hyper‑parameter tuning beyond LoRA’s learning rate is needed.
The whole pipeline is a drop‑in replacement for LoRA: swap the LoRA module with GiVAAdapter(rank=k) and you’re ready to go.
Results & Findings
| Task | Baseline (Full FT) | LoRA (rank = 8) | Vector‑Adapter (random, rank = 64) | GiVA (rank = 8) |
|---|---|---|---|---|
| GLUE‑MNLI | 84.5% | 84.2% | 81.0% | 84.0% |
| SQuAD‑v2 F1 | 88.3 | 88.0 | 84.5 | 87.9 |
| GPT‑2 Summarization (ROUGE‑L) | 31.2 | 30.9 | 28.4 | 30.7 |
| ImageNet (Top‑1) | 78.5% | 78.1% | 75.3% | 77.9% |
- Parameter Savings: GiVA uses ~1 % of the parameters LoRA needs for the same rank, thanks to the 8× rank reduction.
- Training Time: Wall‑clock per epoch is within 5 % of LoRA, far better than the 2–3× slowdown observed with high‑rank vector adapters.
- Stability: Across random seeds, GiVA’s variance is lower than both LoRA and random vector adapters, indicating a more robust initialization.
Practical Implications
- Edge & Mobile Deployments: The tiny adapter footprint (often < 0.1 % of the base model size) makes it feasible to ship a single large foundation model with multiple task‑specific adapters on devices with strict storage limits.
- Rapid Prototyping: Developers can spin up new fine‑tuned variants in minutes without worrying about GPU memory spikes, because the backbone stays frozen and the adapter is minuscule.
- Multi‑Task Serving: A single server can host dozens of GiVA adapters for different customers or languages, swapping only the scalar coefficient tensors at inference time.
- Cost‑Effective MLOps: Lower rank means fewer parameters to checkpoint, version, and transfer, reducing storage and network overhead in CI/CD pipelines.
- Compatibility: GiVA works with any transformer‑style model (BERT, T5, LLaMA, ViT, etc.) and integrates with popular libraries (🤗 Transformers, PEFT).
Limitations & Future Work
- Gradient Proxy Quality: GiVA relies on a representative batch to compute the initial gradient. If the proxy data is biased, the basis may miss important directions, leading to sub‑optimal performance.
- Static Basis: Once initialized, the basis vectors are frozen. The authors note that allowing a small amount of basis fine‑tuning could further close the gap to full fine‑tuning on very niche tasks.
- Scalability of SVD: Computing a truncated SVD on the full gradient matrix can be memory‑intensive for extremely large models (e.g., > 10 B parameters). Future work could explore randomized SVD or low‑rank approximation tricks.
- Beyond Transformers: Experiments focus on transformer‑based NLP and vision models; applying GiVA to diffusion models or graph neural networks remains an open question.
Authors
- Neeraj Gangwar
- Rishabh Deshmukh
- Michael Shavlovsky
- Hancao Li
- Vivek Mittal
- Lexing Ying
- Nickvash Kani
Paper Information
- arXiv ID: 2604.21901v1
- Categories: cs.CL, cs.AI
- Published: April 23, 2026
- PDF: Download PDF