[Paper] LiteEmbed: Adapting CLIP to Rare Classes

Published: 3 weeks ago (January 14, 2026 at 12:53 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.09661v1

Overview

CLIP‑based vision‑language models are great at zero‑shot image classification, but they stumble when asked to recognize classes that were barely seen during pre‑training—think niche product lines, emerging memes, or culturally specific objects. LiteEmbed proposes a lightweight, plug‑and‑play way to personalize CLIP for such “rare” classes without touching the massive image or text encoders.

Key Contributions

Subspace‑guided embedding optimization: Uses a PCA‑derived decomposition of CLIP’s text space to separate coarse semantic directions from fine‑grained variations.
Dual‑objective training:
- Coarse alignment keeps new embeddings anchored to the global semantic structure of CLIP.
- Fine separation pushes rare‑class embeddings apart from visually similar neighbors, boosting discriminability.
Zero‑retraining deployment: Optimized embeddings can replace CLIP’s original text vectors directly in downstream tasks (classification, retrieval, segmentation, detection).
Broad empirical validation: Shows consistent improvements across multiple benchmarks and tasks, outperforming prior few‑shot personalization methods.

Methodology

PCA of CLIP’s text space – The authors run Principal Component Analysis on the pre‑computed CLIP text embeddings of its whole vocabulary. The top components capture broad semantic axes (e.g., “animal vs. vehicle”), while the residual subspace encodes finer nuances.
Embedding initialization – For each new rare class, a seed embedding is generated (e.g., by prompting CLIP with the class name).
Optimization loop – The seed is projected onto the coarse subspace and fine subspace. Two loss terms are applied:
- Coarse alignment loss penalizes drift away from the original coarse direction, preserving overall semantic consistency.
- Fine separation loss (a contrastive term) pushes the fine‑subspace component away from embeddings of visually similar base classes, using a few labeled images per new class.
Plug‑and‑play deployment – The resulting optimized text vectors replace the original CLIP text vectors in any downstream pipeline; no encoder weights are touched, so inference speed and memory stay unchanged.

Results & Findings

Task	Baseline (CLIP)	Prior few‑shot method	LiteEmbed	Relative gain
Image classification (few‑shot on rare classes)	62.4 %	66.1 %	71.8 %	+9.4 %
Text‑to‑image retrieval (rare queries)	48.7 %	52.3 %	58.9 %	+10.2 %
Open‑set segmentation (novel object categories)	41.2 %	44.5 %	50.3 %	+9.1 %
Object detection (few‑shot novel classes)	37.8 %	40.2 %	46.5 %	+8.7 %

Gains are especially pronounced when only 1–5 labeled images per new class are available.
The optimized embeddings retain CLIP’s zero‑shot performance on the original vocabulary, confirming that global semantics are not compromised.

Practical Implications

Rapid product‑specific classifiers: Companies can add a handful of images for a new SKU and instantly get a reliable classifier without re‑training massive models.
Culturally aware AI: Apps that need to understand region‑specific objects (e.g., local foods, traditional garments) can personalize CLIP on‑device with minimal compute.
Cost‑effective personalization: Since only the text embeddings are tuned, the approach fits into existing CLIP pipelines (e.g., OpenAI’s CLIP API, Hugging Face clip-vit-base-patch32) without extra GPU memory or latency.
Cross‑task reuse: The same optimized embeddings improve not just classification but also retrieval, segmentation, and detection, reducing the need for task‑specific fine‑tuning.

Limitations & Future Work

Dependence on PCA quality: The subspace decomposition assumes a linear structure; highly non‑linear semantic relationships may not be captured.
Few‑shot label requirement: While only a handful of images are needed, completely label‑free adaptation (pure zero‑shot) is still out of scope.
Scalability to thousands of new classes: Optimizing embeddings one‑by‑one could become a bottleneck; the authors suggest exploring batch or meta‑learning strategies.
Broader modality tests: Future work could extend LiteEmbed to video‑language models or multimodal transformers beyond CLIP.

Authors

Aishwarya Agarwal
Srikrishna Karanam
Vineet Gandhi

Paper Information

arXiv ID: 2601.09661v1
Categories: cs.CV
Published: January 14, 2026
PDF: Download PDF

[Paper] LiteEmbed: Adapting CLIP to Rare Classes

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation