[Paper] Addressing Image Authenticity When Cameras Use Generative AI
Source: arXiv - 2604.21879v1
Overview
Recent camera pipelines are beginning to embed generative‑AI (GenAI) modules directly into the image signal processor (ISP). While this can improve low‑light performance or enable AI‑based digital zoom, it also introduces the risk that the final JPEG coming out of the camera contains hallucinated (synthetically added) details. The paper Addressing Image Authenticity When Cameras Use Generative AI proposes a lightweight, post‑capture technique that lets users retrieve the “un‑hallucinated” version of a photo, restoring confidence in what the sensor actually recorded.
Key Contributions
- Self‑contained recovery pipeline – An image‑specific MLP decoder paired with a modality‑specific encoder can reconstruct the pre‑hallucination image using only the final camera output.
- Tiny footprint – The combined encoder/decoder model occupies just ~180 KB, small enough to be stored as metadata inside standard JPEG/HEIC containers.
- No ISP access required – The method works entirely after capture; it does not need any privileged access to the camera’s internal processing chain.
- Generalizable to multiple hallucination modalities – Demonstrated on AI‑enhanced digital zoom, low‑light denoising, and edge‑enhancement pipelines.
- Open‑source implementation & benchmark – The authors release code and a dataset of paired “raw‑before‑hallucination” / “post‑hallucination” images for reproducibility.
Methodology
- Data collection – The authors capture pairs of images: the raw sensor output (ground truth) and the same image after it passes through a GenAI‑augmented ISP.
- Modality‑specific encoder – A shallow convolutional encoder extracts a compact latent code (≈ 64 bytes) that describes the hallucination style applied by the ISP (e.g., low‑light boost, AI zoom).
- Image‑specific MLP decoder – For each input photo, a small multi‑layer perceptron (MLP) is optimized per‑image to map the latent code and the hallucinated pixel values back to the original sensor values. The optimization runs for a few hundred iterations, which on a modern laptop takes < 2 seconds.
- Embedding as metadata – The encoder weights and the learned MLP parameters are serialized and stored in the EXIF block of the output JPEG/HEIC file.
- Recovery – When a user (or downstream application) wants the authentic image, the embedded model is loaded, the MLP is executed on the hallucinated image, and the pre‑hallucination version is reconstructed on‑the‑fly.
The whole pipeline is deliberately kept simple: a few convolutional layers for encoding and a 3‑layer MLP (≈ 150 K parameters) for decoding, making it feasible on mobile CPUs and even microcontrollers.
Results & Findings
| Hallucination type | PSNR gain (recovered vs. hallucinated) | SSIM improvement | Visual fidelity |
|---|---|---|---|
| AI‑based digital zoom (4×) | +6.8 dB | +0.12 | Restores true object boundaries, removes “ghost” textures |
| Low‑light enhancement | +5.4 dB | +0.09 | Recovers realistic noise pattern, eliminates over‑bright spots |
| Edge‑enhancement | +4.2 dB | +0.07 | Removes oversharpened halos while preserving true edges |
Across all tested modalities, the recovered images are statistically indistinguishable from the ground‑truth raw captures (p < 0.01). Importantly, the reconstruction runs in under 30 ms on a Snapdragon 8‑gen2 CPU, confirming real‑time feasibility.
Practical Implications
- For developers of camera apps – Embedding the encoder/decoder metadata lets you offer a “view original sensor data” toggle without exposing raw sensor files or requiring privileged ISP hooks.
- For forensic and compliance tools – The method provides a verifiable chain of custody: the recovered image can be audited against the hallucinated version, helping regulators detect manipulated content.
- For device manufacturers – Adding ~180 KB of metadata is negligible compared to typical image sizes (2–8 MB), yet it gives end‑users transparency about AI‑driven post‑processing.
- For cloud‑based photo services – The lightweight model can be executed server‑side to automatically flag images that have undergone aggressive AI enhancements, enabling better content moderation.
- For open‑source camera pipelines (e.g., OpenCV, libcamera) – The approach can be integrated as a post‑processing plugin, giving hobbyists the same authenticity guarantees as flagship phones.
Limitations & Future Work
- Per‑image optimization cost – Although fast on modern hardware, the need to train an MLP for each photo adds latency on low‑power devices (e.g., wearables).
- Scope of hallucination types – The paper focuses on three ISP modules; more exotic generative pipelines (e.g., style transfer, background replacement) may require richer encoder representations.
- Assumption of deterministic ISP – If the ISP includes stochastic elements (e.g., random noise injection), exact recovery becomes probabilistic.
- Future directions suggested by the authors include: (1) learning a universal decoder that generalizes across images, eliminating per‑image training; (2) compressing the metadata further via quantization; and (3) extending the framework to video streams where temporal consistency is crucial.
Authors
- Umar Masud
- Abhijith Punnappurath
- Luxi Zhao
- David B. Lindell
- Michael S. Brown
Paper Information
- arXiv ID: 2604.21879v1
- Categories: cs.CV, cs.AI
- Published: April 23, 2026
- PDF: Download PDF