[Paper] LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge

Published: (February 8, 2026 at 02:37 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.07849v1

Overview

Deploying large Vision‑Language Models (VLMs) on smartphones, wearables, or IoT gateways has been a persistent headache: the models are memory‑hungry, compute‑intensive, and their accuracy drops when the input data distribution shifts (e.g., different lighting, camera quality, or domain). The paper “LQA: A Lightweight Quantized‑Adaptive Framework for Vision‑Language Models on the Edge” proposes a practical solution that lets VLMs run efficiently on edge hardware while automatically adapting to new data without heavy gradients or cloud connectivity.

Key Contributions

  • Selective Hybrid Quantization (SHQ): A modality‑aware quantization scheme that applies different bit‑widths to visual and textual components, preserving critical information while cutting memory usage.
  • Gradient‑free Test‑time Adaptation (TTA): An adaptation loop that updates only a tiny set of lightweight parameters using a closed‑form, gradient‑free optimizer, making it feasible on devices with limited RAM/CPU.
  • End‑to‑end edge‑ready pipeline: Combines SHQ and the gradient‑free TTA into a single framework (LQA) that can be plugged into existing VLMs with minimal code changes.
  • Comprehensive evaluation: Demonstrates consistent gains across 7 public datasets covering synthetic corruptions (e.g., noise, blur) and real‑world domain shifts (e.g., night‑time scenes, medical imaging).
  • Resource savings: Achieves up to 19.9× lower memory footprint compared with full‑precision, gradient‑based TTA methods, while improving adaptation accuracy by ~4.5% on average.

Methodology

  1. Modality‑aware Quantization

    • Visual branch: Quantized to 4‑bit for convolutional feature extractors, but retains an 8‑bit “high‑precision lane” for attention maps that are highly sensitive to quantization noise.
    • Textual branch: Kept at 8‑bit because language embeddings are less tolerant to aggressive quantization.
    • The Selective Hybrid Quantization (SHQ) dynamically chooses which layers get the lower bit‑width based on a sensitivity analysis performed offline.
  2. Gradient‑free Test‑time Adaptation

    • Instead of back‑propagating through the whole network, LQA introduces a tiny set of adapter modules (≈0.1 % of total parameters) placed after the multimodal fusion layer.
    • During inference on a new batch, the adapters are updated using a closed‑form solution derived from a regularized least‑squares objective that aligns model predictions with a self‑supervised consistency loss (e.g., augmentations of the same image‑text pair should produce similar embeddings).
    • Because the update is analytic, it requires only matrix multiplications—no gradient accumulation, no optimizer state, and negligible memory overhead.
  3. Deployment Pipeline

    • The quantized VLM is first compiled for the target edge accelerator (e.g., ARM Cortex‑A78, NPU).
    • At runtime, each incoming sample triggers the lightweight adapter update; the rest of the model runs entirely in quantized integer arithmetic, preserving speed and power efficiency.

Results & Findings

Dataset / ShiftBaseline FP VLM (no TTA)Gradient‑based TTALQA (SHQ + Gradient‑free TTA)
ImageNet‑C (synthetic corruptions)68.2 %71.1 %75.7 % (+4.5 % over baseline)
Night‑time Driving (real‑world)61.4 %63.0 %66.8 %
Medical X‑ray Captioning55.0 %56.2 %59.1 %
Memory usage (MB)1,2001,200 (full‑precision)≈ 60 (≈19.9× reduction)
Adaptation latency per batch (ms)124514
  • Accuracy boost: Across all seven benchmarks, LQA consistently outperforms both the non‑adapted model and the strongest gradient‑based TTA baselines, with an average improvement of 4.5 % in top‑1 accuracy.
  • Memory & latency: The hybrid quantization slashes model size to under 100 MB, and the gradient‑free update adds only a few milliseconds of overhead, keeping real‑time performance intact.
  • Privacy‑preserving: Since adaptation happens entirely on‑device with no gradient exchange, user data never leaves the edge, aligning with GDPR‑style constraints.

Practical Implications

  • Edge AI products: Developers can now embed powerful VLM capabilities (e.g., image captioning, visual question answering) into smartphones, AR glasses, or industrial cameras without needing a cloud fallback.
  • Reduced OTA updates: The model can self‑adjust to new lighting conditions, sensor drift, or domain changes on the fly, decreasing the frequency of costly firmware releases.
  • Energy efficiency: Quantized inference combined with a near‑zero‑cost adaptation loop translates to lower battery consumption—critical for wearables and drones.
  • Privacy‑first services: Applications such as on‑device medical image analysis or personal photo organization can adapt to user‑specific data while keeping everything local, satisfying strict privacy regulations.
  • Simplified dev‑ops: Because LQA works with existing open‑source VLMs (e.g., CLIP, BLIP) via a plug‑and‑play adapter, teams can retrofit their pipelines without retraining large models from scratch.

Limitations & Future Work

  • Sensitivity to quantization hyper‑parameters: The SHQ scheme requires an offline analysis to decide per‑layer bit‑widths; mis‑configuration could degrade performance on unseen hardware.
  • Adapter capacity: The current adapters are deliberately tiny; while sufficient for the evaluated shifts, more extreme domain gaps (e.g., medical modalities vastly different from natural images) may need larger adaptation blocks.
  • Hardware compatibility: The paper targets general ARM‑based NPUs; performance on highly specialized accelerators (e.g., Qualcomm Hexagon, Apple Neural Engine) still needs validation.
  • Future directions: The authors suggest exploring auto‑tuned quantization that runs on the device, extending the gradient‑free adaptation to multimodal generation tasks, and integrating continual learning safeguards to avoid catastrophic forgetting over long‑term deployment.

Authors

  • Xin Wang
  • Hualin Zhou
  • Sheng Guang Wang
  • Ting Dang
  • Yu Zhang
  • Hong Jia
  • Tao Gu

Paper Information

  • arXiv ID: 2602.07849v1
  • Categories: cs.AI
  • Published: February 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »