AVX2 SIMD Optimization for 12-bit JPEG Decoding in libjpeg-turbo — Pair Programming with Copilot CLI

Published: 3 days ago (February 10, 2026 at 03:06 AM EST)

4 min read

Source: Dev.to

Overview

I added AVX2 SIMD optimizations to libjpeg‑turbo’s 12‑bit JPEG decoding pipeline, achieving a 4.6 % speedup on Full HD and 2.5 % on 4K images.

libjpeg‑turbo is the world’s most widely used JPEG library, with highly optimized SIMD paths for 8‑bit JPEG.
12‑bit JPEG (used in medical imaging and high‑precision workflows) previously had zero SIMD support—everything ran as scalar C code.

Using perf profiling, I identified three hotspots and implemented AVX2 intrinsics for each.

Hotspots and Implementation

Target	Implementation	Impact
IDCT (Inverse DCT)	64‑bit arithmetic + AVX2 parallelization	~3 %
YCC → RGB Color Conversion	SIMD compute + packed RGB interleave output	~3 %
H2V2 Fancy Upsample	16‑bit SIMD weighted interpolation	~1.8 %

The SIMD‑able portion (IDCT + color conversion + upsampling ≈ 44 % of CPU time) was effectively optimized across all three targets. The remaining 37.6 % of CPU time is spent in Huffman decoding, which cannot be SIMD‑ized due to the sequential bit‑dependency in the JPEG spec.

Benchmarks

Hardware: AMD Ryzen 9 9950X, GCC 13.3.0, -O3

Resolution	Before	After	Improvement
Full HD (1920 × 1080)	27.87 ms	26.58 ms	4.6 %
4K (3840 × 2160)	113.07 ms	110.26 ms	2.5 %

All 662 tests pass – JPEG compliance tests allow zero tolerance for bit‑level differences.

Repository Details

Repository:
Profile breakdown (representative run):

 37.63%  decode_mcu                    ← Huffman decoding (cannot SIMD)
 21.95%  jsimd_idct_islow_avx2_12bit   ← ✅ AVX2 optimized
 11.38%  ycc_rgb_convert               ← ✅ AVX2 optimized
 10.57%  h2v2_fancy_upsample           ← ✅ AVX2 optimized
  8.25%  put_rgb                       ← File I/O
  5.00%  jpeg_fill_bit_buffer          ← Bitstream parsing

Sample AVX2 Code (YCC → RGB conversion)

// 12‑bit samples are stored in 16‑bit lanes → widen to 32‑bit for arithmetic → pack back to 16‑bit
__m256i y = _mm256_cvtepu16_epi32(_mm_loadu_si128((__m128i *)inptr0));
// ... AVX2 YCC→RGB conversion ...
__m256i r16 = _mm256_packus_epi32(r, zero);  // 32‑bit → 16‑bit pack

Development Process with Copilot CLI

The entire project was built exclusively through GitHub Copilot CLI in the terminal—no IDE was used.
Copilot CLI handled:
- perf record / perf report execution and analysis
- AVX2 intrinsics code generation
- Running all 662 ctest tests and benchmarking

Highlights

Profiling‑driven prioritization – After each perf run, Copilot suggested the next function to optimize based on CPU‑time share, keeping the work focused on high‑impact targets.
AVX2 intrinsics generation – Complex instructions such as _mm256_packus_epi32, _mm256_permute4x64_epi64, and _mm256_cvtepu16_epi32 were generated correctly without manual reference to Intel manuals.
Debugging bit‑level failures – The 12‑bit IDCT initially suffered 1‑bit rounding errors. Copilot helped diagnose an overflow issue and switch from 32‑bit to 64‑bit intermediate arithmetic, fixing compliance failures.
A/B testing infrastructure – Copilot introduced the JPEG12_IDCT_FORCE_C environment variable to toggle between SIMD and scalar paths, enabling clean before/after benchmarking.

Why the CLI Was Perfect for This Work

SIMD optimization follows a tight write → build → test → profile → analyze → rewrite loop. Copilot CLI keeps this cycle entirely within the terminal, eliminating context switches to an editor.
libjpeg‑turbo’s 12‑bit types (J12SAMPLE, J12SAMPROW, J12SAMPARRAY) are rarely seen in training data. Copilot initially generated dispatch logic using the compile‑time BITS_IN_JSAMPLE macro, but the correct approach required runtime data_precision checks because libjpeg‑turbo builds a single binary supporting multiple precisions.
Measuring small gains (2‑3 %) on an already world‑class baseline demands careful benchmark design with multiple runs and statistical analysis—tasks that Copilot automated within the terminal workflow.

Conclusion

By targeting the three most time‑consuming SIMD‑able stages of the 12‑bit JPEG decoding pipeline, the AVX2 optimizations deliver measurable performance improvements without sacrificing JPEG compliance. The project demonstrates how GitHub Copilot CLI can drive low‑level SIMD development end‑to‑end, from profiling to production‑ready code, entirely from the command line.

AVX2 SIMD Optimization for 12-bit JPEG Decoding in libjpeg-turbo — Pair Programming with Copilot CLI

Overview

Hotspots and Implementation

Benchmarks

Repository Details

Sample AVX2 Code (YCC → RGB conversion)

Development Process with Copilot CLI

Highlights

Why the CLI Was Perfect for This Work

Conclusion

Related posts

Next.js Weekly #117: vS3, TypeScript 6.0 Beta, Bulletproof Component, AI Debugging, Enterprise Next.js, State of React 2025

New article

Your Documentation is Lying to Your Users

Build a Serverless RAG Engine for $0