[Paper] What Matters in Practical Learned Image Compression

Published: 4 days ago (May 6, 2026 at 01:17 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.05148v1

Overview

This paper tackles a long‑standing gap in learned image compression: building a codec that is both perceptually optimal and fast enough for real‑world devices. By systematically exploring architecture choices, training tricks, and a performance‑aware neural architecture search, the authors deliver a neural compressor that dramatically outperforms traditional standards (AV1, VVC, JPEG‑AI) and prior learned methods while running on a consumer‑grade smartphone in a few hundred milliseconds.

Key Contributions

Comprehensive ablation study of the design knobs that affect perceptual quality, bitrate, and runtime in learned codecs.
Introduction of novel training and model‑level techniques (e.g., perceptual loss weighting, entropy model refinements, lightweight attention modules) that improve the speed‑quality trade‑off.
Performance‑aware Neural Architecture Search (NAS) over millions of backbone configurations, explicitly constrained by on‑device latency targets.
Construction of a practical end‑to‑end codec that achieves 2.3–3× bitrate savings versus AV1/AV2/VVC/ECM/JPEG‑AI and 20–40% savings over the strongest learned baselines.
Real‑time on‑device benchmarks: 12 MP image encoding in ~230 ms and decoding in ~150 ms on an iPhone 17 Pro Max, beating many GPU‑based ML codecs.
Rigorous subjective user studies confirming that the perceptual gains translate to human‑perceived quality improvements.

Methodology

Baseline Architecture – The authors start from a modern auto‑encoder with a hyper‑prior entropy model, a common backbone for learned compression.
Design Space Exploration – They isolate key components (e.g., convolutional block type, channel width, attention placement, entropy model granularity) and evaluate each for three axes:
- Perceptual quality (measured by LPIPS, MS‑SSIM, and human MOS).
- Bitrate efficiency (bits per pixel).
- Runtime (CPU/GPU/phone inference time).
Novel Optimizations –
- Perceptual‑aware loss scheduling that gradually shifts emphasis from distortion to perceptual metrics during training.
- Grouped entropy coding to reduce the overhead of context modeling without sacrificing compression.
- Lightweight attention blocks (e.g., squeeze‑excitation) that add expressive power with minimal FLOPs.
Performance‑Aware NAS – Using a multi‑objective evolutionary algorithm, they search the combinatorial space of backbone configurations while enforcing a hard latency constraint measured on the target device. The fitness function balances bitrate‑per‑perceptual‑score against the latency budget.
End‑to‑End System Integration – The selected architecture is quantized to 8‑bit, compiled with Apple’s CoreML, and paired with a fast entropy coder to meet the on‑device speed targets.
Evaluation – Objective metrics are complemented by large‑scale double‑blind user studies to validate perceptual superiority.

Results & Findings

Metric	Proposed Codec	Best Traditional (VVC)	Best Prior Learned
Bitrate (bps) @ comparable MOS	0.45 bpp	1.0 bpp (≈2.2× higher)	0.58 bpp (≈1.3× higher)
LPIPS (lower is better)	0.12	0.22	0.16
Encoding latency (12 MP)	230 ms (iPhone 17 Pro Max)	N/A (desktop)	340 ms (GPU)
Decoding latency (12 MP)	150 ms (iPhone)	N/A	210 ms (GPU)

Subjective MOS: Users consistently rated the new codec higher than all baselines, confirming that the objective gains are perceptually meaningful.
Speed: The on‑device encoder/decoder is ~30 % faster than the previous state‑of‑the‑art learned codec running on a high‑end NVIDIA V100, demonstrating that careful architecture‑runtime co‑design can beat heavyweight GPU solutions.
Ablation Insights: Perceptual loss scheduling contributed ~0.05 bpp savings; lightweight attention added ~0.03 bpp without noticeable latency increase; entropy model tweaks shaved ~10 % runtime.

Practical Implications

Mobile Photo Apps – Developers can integrate a plug‑and‑play compression module that reduces upload bandwidth by up to 3× while keeping visual quality high, directly benefiting user experience and data costs.
Edge‑AI Pipelines – Real‑time image streaming from drones, AR glasses, or IoT cameras can now rely on on‑device neural compression without offloading to the cloud, saving latency and preserving privacy.
Content Delivery Networks – The codec’s bitrate efficiency can lower storage and CDN egress costs; its fast decode path makes it suitable for browsers or native viewers that need instant image rendering.
Standardization & Interoperability – Although not a formal standard, the open‑source implementation (if released) could serve as a reference for future perceptual‑oriented image coding standards, influencing JPEG‑AI or next‑gen codecs.
Developer Tooling – The performance‑aware NAS pipeline showcased in the paper can be repurposed for other on‑device ML tasks where latency is a hard constraint (e.g., super‑resolution, denoising).

Limitations & Future Work

Hardware Specificity – The latency budget and NAS search were tuned for Apple silicon; performance on Android or embedded CPUs may differ and would require a separate search.
Training Cost – The multi‑objective NAS over millions of configurations is computationally intensive, which may be prohibitive for smaller research teams.
Generalization to Video – The study focuses on still images; extending the perceptual‑runtime co‑design to video codecs (temporal entropy, motion) remains an open challenge.
Robustness to Diverse Content – While the user study covered a broad set of images, edge cases (e.g., medical imaging, satellite data) may need domain‑specific fine‑tuning.

Future directions include cross‑platform NAS, adaptive bitrate control based on device load, and joint optimization with downstream vision models (e.g., object detection on compressed inputs).

Authors

Kedar Tatwawadi
Parisa Rahimzadeh
Zhanghao Sun
Zhiqi Chen
Ziyun Yang
Sanjay Nair
Divija Hasteer
Oren Rippel

Paper Information

arXiv ID: 2605.05148v1
Categories: cs.CV, cs.AI, cs.LG
Published: May 6, 2026
PDF: Download PDF

[Paper] What Matters in Practical Learned Image Compression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Flow-OPD: On-Policy Distillation for Flow Matching Models

[Paper] SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation