[Paper] Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach
Source: arXiv - 2602.10079v1
Overview
The paper presents Forensim, a unified deep‑learning model that can spot and localize both splicing (inserting a foreign object) and copy‑move (duplicating a region within the same image) forgeries. By jointly identifying the source (where the duplicated content came from) and the target (where it was pasted), the system provides richer context than traditional detectors that only flag “tampered” pixels.
Key Contributions
- Unified three‑class segmentation (pristine / source / target) that works for both splicing and copy‑move attacks.
- Attention‑based visual state‑space formulation that turns normalized attention maps into a similarity search across the whole image.
- Region‑based block attention module that refines the coarse similarity map into precise manipulated boundaries.
- End‑to‑end trainable architecture – no separate feature extraction, similarity matching, and post‑processing steps.
- CMFD‑Anything dataset: a large, diverse collection of copy‑move forgeries that overcomes the limited realism of prior benchmarks.
- State‑of‑the‑art results on standard splicing and copy‑move datasets, with notable gains in source‑region localization accuracy.
Methodology
- Backbone encoder – a standard CNN (e.g., ResNet‑50) extracts a dense feature map from the input image.
- Normalized attention maps – each spatial location attends to every other location via a softmax‑scaled similarity matrix, effectively building a visual state‑space where each “state” is a feature vector.
- Visual state‑space module – the attention matrix is normalized and thresholded to highlight pairs of regions that are unusually similar, a hallmark of copy‑move duplication.
- Block attention module – the image is divided into overlapping blocks; attention scores are aggregated per block, allowing the network to differentiate between genuine repeated patterns (e.g., textures) and malicious duplication.
- Three‑class decoder – a lightweight upsampling head predicts a pixel‑wise mask with three labels: pristine, source, and target. The loss combines cross‑entropy with a boundary‑aware term to sharpen edges.
- Training – the model is trained on a mix of splicing and copy‑move examples (including the new CMFD‑Anything data) using standard stochastic gradient descent, requiring only image–mask pairs.
The whole pipeline runs in a single forward pass, making it suitable for real‑time or batch processing pipelines.
Results & Findings
| Dataset | Metric (Target IoU) | Metric (Source IoU) | Relative Gain vs. Prior SOTA |
|---|---|---|---|
| CASIA‑V2 (splicing) | 0.84 | – | +5 % |
| CoMoFoD (copy‑move) | 0.78 | 0.71 | +7 % (target) / +9 % (source) |
| CMFD‑Anything (new) | 0.81 | 0.73 | — (baseline) |
- The model consistently outperforms separate splicing‑only and copy‑move‑only detectors, especially on the source region, confirming the benefit of joint learning.
- Qualitative examples show clear separation of duplicated objects from their origins, even when the duplicated region undergoes slight geometric transformations (rotation, scaling).
- Ablation studies indicate that removing the block‑attention module drops source IoU by ~6 %, highlighting its role in suppressing false positives from natural repetitions.
Practical Implications
- Content‑moderation pipelines can now flag not just “this image is manipulated” but also where the manipulation originated, aiding fact‑checkers and journalists in reconstructing the narrative.
- Digital forensics tools can automate the tedious manual step of locating the source region, saving hours of analyst time.
- Social‑media platforms can integrate Forensim as a lightweight micro‑service (≈ 30 ms per 512×512 image on a modern GPU) to screen user‑generated content in near‑real time.
- Security‑aware ML systems (e.g., deep‑fake detection) can benefit from the same attention‑state‑space idea to detect subtle copy‑move attacks in video frames.
- The released CMFD‑Anything dataset provides a realistic benchmark for developers building their own forgery detectors, encouraging reproducibility and further innovation.
Limitations & Future Work
- The current model assumes a single source‑target pair; complex forgeries involving multiple duplicated regions may need hierarchical extensions.
- Performance degrades on very high‑resolution images (> 4 K) due to memory constraints of the full‑image attention matrix; approximate or hierarchical attention could alleviate this.
- The authors note that adversarial post‑processing (e.g., strong JPEG compression, aggressive noise) can weaken the similarity cues, suggesting a line of work on robustness to compression artifacts.
- Future research directions include extending the state‑space formulation to video (temporal copy‑move) and integrating semantic priors (e.g., object detectors) to further reduce false positives on naturally repetitive textures.
Authors
- Soumyaroop Nandi
- Prem Natarajan
Paper Information
- arXiv ID: 2602.10079v1
- Categories: cs.CV
- Published: February 10, 2026
- PDF: Download PDF