AVX2 SIMD Optimization for 12-bit JPEG Decoding in libjpeg-turbo — Pair Programming with Copilot CLI

Published: (February 10, 2026 at 03:06 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Overview

I added AVX2 SIMD optimizations to libjpeg‑turbo’s 12‑bit JPEG decoding pipeline, achieving a 4.6 % speedup on Full HD and 2.5 % on 4K images.

  • libjpeg‑turbo is the world’s most widely used JPEG library, with highly optimized SIMD paths for 8‑bit JPEG.
  • 12‑bit JPEG (used in medical imaging and high‑precision workflows) previously had zero SIMD support—everything ran as scalar C code.

Using perf profiling, I identified three hotspots and implemented AVX2 intrinsics for each.

Hotspots and Implementation

TargetImplementationImpact
IDCT (Inverse DCT)64‑bit arithmetic + AVX2 parallelization~3 %
YCC → RGB Color ConversionSIMD compute + packed RGB interleave output~3 %
H2V2 Fancy Upsample16‑bit SIMD weighted interpolation~1.8 %

The SIMD‑able portion (IDCT + color conversion + upsampling ≈ 44 % of CPU time) was effectively optimized across all three targets. The remaining 37.6 % of CPU time is spent in Huffman decoding, which cannot be SIMD‑ized due to the sequential bit‑dependency in the JPEG spec.

Benchmarks

Hardware: AMD Ryzen 9 9950X, GCC 13.3.0, -O3

ResolutionBeforeAfterImprovement
Full HD (1920 × 1080)27.87 ms26.58 ms4.6 %
4K (3840 × 2160)113.07 ms110.26 ms2.5 %

All 662 tests pass – JPEG compliance tests allow zero tolerance for bit‑level differences.

Repository Details

  • Repository:
  • Profile breakdown (representative run):
 37.63%  decode_mcu                    ← Huffman decoding (cannot SIMD)
 21.95%  jsimd_idct_islow_avx2_12bit   ← ✅ AVX2 optimized
 11.38%  ycc_rgb_convert               ← ✅ AVX2 optimized
 10.57%  h2v2_fancy_upsample           ← ✅ AVX2 optimized
  8.25%  put_rgb                       ← File I/O
  5.00%  jpeg_fill_bit_buffer          ← Bitstream parsing

Sample AVX2 Code (YCC → RGB conversion)

// 12‑bit samples are stored in 16‑bit lanes → widen to 32‑bit for arithmetic → pack back to 16‑bit
__m256i y = _mm256_cvtepu16_epi32(_mm_loadu_si128((__m128i *)inptr0));
// ... AVX2 YCC→RGB conversion ...
__m256i r16 = _mm256_packus_epi32(r, zero);  // 32‑bit → 16‑bit pack

Development Process with Copilot CLI

  • The entire project was built exclusively through GitHub Copilot CLI in the terminal—no IDE was used.
  • Copilot CLI handled:
    • perf record / perf report execution and analysis
    • AVX2 intrinsics code generation
    • Running all 662 ctest tests and benchmarking

Highlights

  1. Profiling‑driven prioritization – After each perf run, Copilot suggested the next function to optimize based on CPU‑time share, keeping the work focused on high‑impact targets.
  2. AVX2 intrinsics generation – Complex instructions such as _mm256_packus_epi32, _mm256_permute4x64_epi64, and _mm256_cvtepu16_epi32 were generated correctly without manual reference to Intel manuals.
  3. Debugging bit‑level failures – The 12‑bit IDCT initially suffered 1‑bit rounding errors. Copilot helped diagnose an overflow issue and switch from 32‑bit to 64‑bit intermediate arithmetic, fixing compliance failures.
  4. A/B testing infrastructure – Copilot introduced the JPEG12_IDCT_FORCE_C environment variable to toggle between SIMD and scalar paths, enabling clean before/after benchmarking.

Why the CLI Was Perfect for This Work

  • SIMD optimization follows a tight write → build → test → profile → analyze → rewrite loop. Copilot CLI keeps this cycle entirely within the terminal, eliminating context switches to an editor.
  • libjpeg‑turbo’s 12‑bit types (J12SAMPLE, J12SAMPROW, J12SAMPARRAY) are rarely seen in training data. Copilot initially generated dispatch logic using the compile‑time BITS_IN_JSAMPLE macro, but the correct approach required runtime data_precision checks because libjpeg‑turbo builds a single binary supporting multiple precisions.
  • Measuring small gains (2‑3 %) on an already world‑class baseline demands careful benchmark design with multiple runs and statistical analysis—tasks that Copilot automated within the terminal workflow.

Conclusion

By targeting the three most time‑consuming SIMD‑able stages of the 12‑bit JPEG decoding pipeline, the AVX2 optimizations deliver measurable performance improvements without sacrificing JPEG compliance. The project demonstrates how GitHub Copilot CLI can drive low‑level SIMD development end‑to‑end, from profiling to production‑ready code, entirely from the command line.

0 views
Back to Blog

Related posts

Read more »

New article

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as we...

Build a Serverless RAG Engine for $0

Introduction: The Problem with “Toy” RAG Apps Most RAG tutorials skip the hard parts that actually matter in production: - No security model: Users can access...