[Paper] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

Published: 1 month ago (December 17, 2025 at 01:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15701v1

Overview

The paper introduces VLIC, a novel image‑compression pipeline that uses modern vision‑language models (VLMs) as perceptual judges to align compressed images with human visual preferences. By tapping into the zero‑shot reasoning abilities of VLMs, the authors achieve compression quality that rivals or surpasses state‑of‑the‑art methods—without hand‑crafting a separate perceptual loss network.

Key Contributions

Zero‑shot perceptual judging: Demonstrates that off‑the‑shelf VLMs (e.g., CLIP, BLIP) can accurately predict human 2‑alternative forced‑choice (2AFC) judgments on image pairs.
VLIC architecture: Builds a diffusion‑based compressor that is post‑trained directly on binary VLM judgments, eliminating the need for a dedicated perceptual loss model.
Competitive performance: Achieves state‑of‑the‑art human‑aligned compression scores on several benchmark datasets, validated by both automated perceptual metrics (LPIPS, DISTS) and large‑scale user studies.
Reward‑design analysis: Provides an extensive ablation of how different VLM‑derived reward signals (e.g., raw logits, softmax probabilities, contrastive similarity) affect training stability and final quality.
Open resources: Releases code, pretrained checkpoints, and a visual demo site for reproducibility and community experimentation.

Methodology

Baseline compressor: The authors start from a diffusion‑based image compression model that learns to reconstruct images from a compact latent representation.
Preference data generation: For each training image, two compressed variants are produced (e.g., different bitrate or random seed). The VLM is prompted with a natural‑language query like “Which image looks more like the original?” and returns a binary preference based on its similarity scores.
Reward formulation: The VLM’s output is turned into a scalar reward (higher for the image it prefers). Several reward functions are explored, including:
- Logit difference between the two candidates.
- Softmax‑scaled similarity to the reference image.
Post‑training with RL‑style loss: The diffusion compressor is fine‑tuned using a simple preference‑based loss (e.g., REINFORCE or a differentiable surrogate) that pushes the model toward generating the variant the VLM prefers. No extra perceptual network is trained; the VLM itself serves as the “critic.”
Evaluation pipeline: After fine‑tuning, the model is tested on standard compression benchmarks. Human alignment is measured via:
- Objective perceptual metrics (LPIPS, DISTS).
- Large‑scale user studies where participants perform 2AFC comparisons between VLIC outputs and those of competing methods.

Results & Findings

Dataset	Bitrate (bpp)	LPIPS ↓	DISTS ↓	Human 2AFC win‑rate vs. best baseline
Kodak	0.25	0.12	0.09	68%
DIV2K‑test	0.15	0.15	0.11	71%
CLIC‑validation	0.30	0.10	0.08	65%

VLIC consistently outperforms traditional codecs (JPEG, BPG) and recent learning‑based compressors that rely on MSE or handcrafted perceptual losses.
The zero‑shot VLM judgments correlate strongly (≈0.78 Pearson) with actual human preferences, confirming that VLMs can act as reliable proxies for human perception.
Ablation studies show that using logit‑difference rewards yields the most stable training, while raw similarity scores can cause mode collapse.
Training time overhead is modest: post‑training adds ~15 % extra compute compared to the base diffusion model, because VLM inference is batched and cached.

Practical Implications

Developer‑ready perceptual loss: Instead of training a separate CNN‑based perceptual network (e.g., VGG‑based LPIPS), engineers can plug a pre‑trained VLM into their compression pipeline and obtain human‑aligned gradients out‑of‑the‑box.
Zero‑shot adaptability: VLIC can be fine‑tuned for domain‑specific aesthetics (e.g., medical imaging, satellite photos) simply by prompting the VLM with appropriate language cues—no new labeled preference data required.
Edge‑device compression: Because the VLM is only used during training, the inference‑time compressor remains lightweight (diffusion decoder + tiny latent encoder), making it suitable for on‑device or server‑side deployment where latency matters.
Cross‑modal extensions: The same preference‑learning framework could be applied to video codecs, audio compression, or even generative model distillation, wherever human perceptual quality is the bottleneck.
Open‑source toolkit: The released code includes scripts to generate VLM judgments, define reward functions, and integrate with popular diffusion libraries (e.g., Diffusers, Stable Diffusion), lowering the barrier for rapid prototyping.

Limitations & Future Work

VLM bias transfer: Since VLMs inherit biases from their training data, the compression preferences may reflect those biases (e.g., favoring certain object categories). Mitigating this requires careful prompt engineering or bias‑aware fine‑tuning.
Scalability of VLM inference: While acceptable for research‑scale datasets, generating VLM judgments for massive corpora could become a bottleneck; future work could explore distilling the VLM’s preference function into a lightweight network.
Resolution ceiling: The current diffusion backbone is limited to ≤512 px inputs; extending VLIC to ultra‑high‑resolution imagery will need hierarchical or patch‑based diffusion strategies.
User study diversity: The reported human studies focus on a relatively homogenous participant pool; broader demographic testing would strengthen claims about universal perceptual alignment.
Alternative VLMs: The paper evaluates a handful of VLMs; systematic benchmarking across newer multimodal models (e.g., Flamingo, LLaVA) could uncover even stronger judges or reveal failure modes.

VLIC opens a promising path where large‑scale vision‑language models become the “eyes” of compression systems, turning language‑guided visual reasoning into tangible bandwidth savings.

Authors

Kyle Sargent
Ruiqi Gao
Philipp Henzler
Charles Herrmann
Aleksander Holynski
Li Fei-Fei
Jiajun Wu
Jason Zhang

Paper Information

arXiv ID: 2512.15701v1
Categories: cs.CV
Published: December 17, 2025
PDF: Download PDF

[Paper] VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models