[Paper] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

Published: (December 9, 2025 at 01:05 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.08873v1

Overview

The paper introduces SOLI (Siamese‑Driven Optimization for Low‑Resolution Image Latent Embedding), a lightweight framework that boosts image‑captioning performance on low‑resolution pictures without the heavy computational cost of large transformer encoders. By leveraging a Siamese network to learn richer latent embeddings, SOLI makes it feasible to deploy captioning models on edge devices or in environments with limited GPU memory.

Key Contributions

  • Siamese‑based latent embedding: A dual‑branch network that jointly processes the original low‑resolution image and a super‑resolved counterpart, forcing the encoder to learn resolution‑invariant features.
  • Lightweight architecture: Uses a compact CNN backbone (e.g., MobileNetV2) instead of heavyweight Vision Transformers, cutting inference latency by up to 45 % on a Raspberry Pi 4.
  • Joint optimization loss: Combines contrastive loss (to align the two branches) with the standard cross‑entropy captioning loss, improving semantic consistency.
  • Resource‑aware training pipeline: Introduces a curriculum that gradually increases image resolution during fine‑tuning, allowing models to converge with ≤ 2 GB GPU memory.
  • Comprehensive evaluation: Benchmarks on MS‑COCO‑LR (a low‑resolution subset) and a real‑world assistive‑technology dataset, showing +3.2 CIDEr over baseline CNN‑LSTM models while using ≈ 30 % fewer parameters.

Methodology

  1. Dual‑pathway Siamese Encoder

    • Branch A receives the raw low‑resolution image (e.g., 64×64).
    • Branch B receives a super‑resolved version generated on‑the‑fly by a tiny up‑sampling module (e.g., a 2‑layer sub‑pixel CNN).
    • Both branches share the same lightweight CNN weights, ensuring they learn a common representation space.
  2. Contrastive Alignment

    • A contrastive loss pulls the embeddings of the two branches together for the same image while pushing apart embeddings of different images.
    • This forces the encoder to ignore resolution‑specific noise and focus on high‑level semantics.
  3. Caption Decoder

    • The merged embedding (averaged from the two branches) feeds into a modest LSTM‑based decoder with attention.
    • Standard teacher‑forcing and cross‑entropy loss are applied, plus a reinforcement‑learning fine‑tune step (CIDEr‑optimisation).
  4. Training Curriculum

    • Starts with pure low‑resolution inputs, then progressively introduces higher‑resolution super‑resolved images, allowing the network to adapt smoothly without exploding gradients.

Results & Findings

ModelParams (M)FLOPs (G)CIDEr ↑BLEU‑4 ↑
Baseline CNN‑LSTM (64×64)12.42.1106.534.2
Vision‑Transformer (large)85.015.8109.835.1
SOLI (proposed)14.12.4109.735.0
  • Latency on a Raspberry Pi 4: SOLI ≈ 180 ms per image vs. Transformer ≈ 620 ms.
  • Memory footprint during training stayed under 2 GB, enabling fine‑tuning on consumer‑grade GPUs.
  • Qualitative analysis shows SOLI captions retain fine details (e.g., “a red bicycle leaning against a brick wall”) that baseline models often miss on low‑res inputs.

Practical Implications

  • Edge deployment: Developers can embed SOLI into mobile apps, smart cameras, or assistive devices for visually impaired users without needing a cloud backend.
  • Cost‑effective scaling: Companies can run captioning services on cheaper hardware (e.g., commodity CPUs or low‑end GPUs), reducing operational expenses.
  • Robustness to bandwidth constraints: In IoT scenarios where images are transmitted at low resolution to save bandwidth, SOLI can still generate high‑quality descriptions.
  • Plug‑and‑play integration: Because SOLI uses standard CNN and LSTM components, it can be swapped into existing image‑captioning pipelines with minimal code changes.

Limitations & Future Work

  • Resolution ceiling: SOLI is tuned for very low resolutions (≤ 64×64); performance gains diminish on higher‑resolution images where heavyweight encoders already excel.
  • Super‑resolution dependency: The on‑the‑fly up‑sampling module adds a small overhead; future work could explore learned embeddings that bypass explicit super‑resolution.
  • Domain generalization: Experiments were limited to COCO‑style scenes and a small assistive‑tech dataset; broader domain testing (e.g., medical imaging) is needed.
  • Multilingual captions: The current decoder is English‑only; extending the framework to multilingual generation is an open avenue.

Overall, SOLI demonstrates that clever architectural choices—specifically a Siamese‑driven latent embedding—can close the performance gap for low‑resolution image captioning while staying lightweight enough for real‑world, resource‑constrained deployments.

Authors

  • Jing Jie Tan
  • Anissa Mokraoui
  • Ban-Hoe Kwan
  • Danny Wee-Kiat Ng
  • Yan-Chai Hum

Paper Information

  • arXiv ID: 2512.08873v1
  • Categories: cs.CV, cs.AI, cs.HC
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »