[Paper] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning
Source: arXiv - 2512.08873v1
Overview
The paper introduces SOLI (Siamese‑Driven Optimization for Low‑Resolution Image Latent Embedding), a lightweight framework that boosts image‑captioning performance on low‑resolution pictures without the heavy computational cost of large transformer encoders. By leveraging a Siamese network to learn richer latent embeddings, SOLI makes it feasible to deploy captioning models on edge devices or in environments with limited GPU memory.
Key Contributions
- Siamese‑based latent embedding: A dual‑branch network that jointly processes the original low‑resolution image and a super‑resolved counterpart, forcing the encoder to learn resolution‑invariant features.
- Lightweight architecture: Uses a compact CNN backbone (e.g., MobileNetV2) instead of heavyweight Vision Transformers, cutting inference latency by up to 45 % on a Raspberry Pi 4.
- Joint optimization loss: Combines contrastive loss (to align the two branches) with the standard cross‑entropy captioning loss, improving semantic consistency.
- Resource‑aware training pipeline: Introduces a curriculum that gradually increases image resolution during fine‑tuning, allowing models to converge with ≤ 2 GB GPU memory.
- Comprehensive evaluation: Benchmarks on MS‑COCO‑LR (a low‑resolution subset) and a real‑world assistive‑technology dataset, showing +3.2 CIDEr over baseline CNN‑LSTM models while using ≈ 30 % fewer parameters.
Methodology
-
Dual‑pathway Siamese Encoder
- Branch A receives the raw low‑resolution image (e.g., 64×64).
- Branch B receives a super‑resolved version generated on‑the‑fly by a tiny up‑sampling module (e.g., a 2‑layer sub‑pixel CNN).
- Both branches share the same lightweight CNN weights, ensuring they learn a common representation space.
-
Contrastive Alignment
- A contrastive loss pulls the embeddings of the two branches together for the same image while pushing apart embeddings of different images.
- This forces the encoder to ignore resolution‑specific noise and focus on high‑level semantics.
-
Caption Decoder
- The merged embedding (averaged from the two branches) feeds into a modest LSTM‑based decoder with attention.
- Standard teacher‑forcing and cross‑entropy loss are applied, plus a reinforcement‑learning fine‑tune step (CIDEr‑optimisation).
-
Training Curriculum
- Starts with pure low‑resolution inputs, then progressively introduces higher‑resolution super‑resolved images, allowing the network to adapt smoothly without exploding gradients.
Results & Findings
| Model | Params (M) | FLOPs (G) | CIDEr ↑ | BLEU‑4 ↑ |
|---|---|---|---|---|
| Baseline CNN‑LSTM (64×64) | 12.4 | 2.1 | 106.5 | 34.2 |
| Vision‑Transformer (large) | 85.0 | 15.8 | 109.8 | 35.1 |
| SOLI (proposed) | 14.1 | 2.4 | 109.7 | 35.0 |
- Latency on a Raspberry Pi 4: SOLI ≈ 180 ms per image vs. Transformer ≈ 620 ms.
- Memory footprint during training stayed under 2 GB, enabling fine‑tuning on consumer‑grade GPUs.
- Qualitative analysis shows SOLI captions retain fine details (e.g., “a red bicycle leaning against a brick wall”) that baseline models often miss on low‑res inputs.
Practical Implications
- Edge deployment: Developers can embed SOLI into mobile apps, smart cameras, or assistive devices for visually impaired users without needing a cloud backend.
- Cost‑effective scaling: Companies can run captioning services on cheaper hardware (e.g., commodity CPUs or low‑end GPUs), reducing operational expenses.
- Robustness to bandwidth constraints: In IoT scenarios where images are transmitted at low resolution to save bandwidth, SOLI can still generate high‑quality descriptions.
- Plug‑and‑play integration: Because SOLI uses standard CNN and LSTM components, it can be swapped into existing image‑captioning pipelines with minimal code changes.
Limitations & Future Work
- Resolution ceiling: SOLI is tuned for very low resolutions (≤ 64×64); performance gains diminish on higher‑resolution images where heavyweight encoders already excel.
- Super‑resolution dependency: The on‑the‑fly up‑sampling module adds a small overhead; future work could explore learned embeddings that bypass explicit super‑resolution.
- Domain generalization: Experiments were limited to COCO‑style scenes and a small assistive‑tech dataset; broader domain testing (e.g., medical imaging) is needed.
- Multilingual captions: The current decoder is English‑only; extending the framework to multilingual generation is an open avenue.
Overall, SOLI demonstrates that clever architectural choices—specifically a Siamese‑driven latent embedding—can close the performance gap for low‑resolution image captioning while staying lightweight enough for real‑world, resource‑constrained deployments.
Authors
- Jing Jie Tan
- Anissa Mokraoui
- Ban-Hoe Kwan
- Danny Wee-Kiat Ng
- Yan-Chai Hum
Paper Information
- arXiv ID: 2512.08873v1
- Categories: cs.CV, cs.AI, cs.HC
- Published: December 9, 2025
- PDF: Download PDF