[Paper] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

Published: 2 months ago (December 9, 2025 at 01:05 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.08873v1

Overview

The paper introduces SOLI (Siamese‑Driven Optimization for Low‑Resolution Image Latent Embedding), a lightweight framework that boosts image‑captioning performance on low‑resolution pictures without the heavy computational cost of large transformer encoders. By leveraging a Siamese network to learn richer latent embeddings, SOLI makes it feasible to deploy captioning models on edge devices or in environments with limited GPU memory.

Key Contributions

Siamese‑based latent embedding: A dual‑branch network that jointly processes the original low‑resolution image and a super‑resolved counterpart, forcing the encoder to learn resolution‑invariant features.
Lightweight architecture: Uses a compact CNN backbone (e.g., MobileNetV2) instead of heavyweight Vision Transformers, cutting inference latency by up to 45 % on a Raspberry Pi 4.
Joint optimization loss: Combines contrastive loss (to align the two branches) with the standard cross‑entropy captioning loss, improving semantic consistency.
Resource‑aware training pipeline: Introduces a curriculum that gradually increases image resolution during fine‑tuning, allowing models to converge with ≤ 2 GB GPU memory.
Comprehensive evaluation: Benchmarks on MS‑COCO‑LR (a low‑resolution subset) and a real‑world assistive‑technology dataset, showing +3.2 CIDEr over baseline CNN‑LSTM models while using ≈ 30 % fewer parameters.

Methodology

Dual‑pathway Siamese Encoder
- Branch A receives the raw low‑resolution image (e.g., 64×64).
- Branch B receives a super‑resolved version generated on‑the‑fly by a tiny up‑sampling module (e.g., a 2‑layer sub‑pixel CNN).
- Both branches share the same lightweight CNN weights, ensuring they learn a common representation space.
Contrastive Alignment
- A contrastive loss pulls the embeddings of the two branches together for the same image while pushing apart embeddings of different images.
- This forces the encoder to ignore resolution‑specific noise and focus on high‑level semantics.
Caption Decoder
- The merged embedding (averaged from the two branches) feeds into a modest LSTM‑based decoder with attention.
- Standard teacher‑forcing and cross‑entropy loss are applied, plus a reinforcement‑learning fine‑tune step (CIDEr‑optimisation).
Training Curriculum
- Starts with pure low‑resolution inputs, then progressively introduces higher‑resolution super‑resolved images, allowing the network to adapt smoothly without exploding gradients.

Results & Findings

Model	Params (M)	FLOPs (G)	CIDEr ↑	BLEU‑4 ↑
Baseline CNN‑LSTM (64×64)	12.4	2.1	106.5	34.2
Vision‑Transformer (large)	85.0	15.8	109.8	35.1
SOLI (proposed)	14.1	2.4	109.7	35.0

Latency on a Raspberry Pi 4: SOLI ≈ 180 ms per image vs. Transformer ≈ 620 ms.
Memory footprint during training stayed under 2 GB, enabling fine‑tuning on consumer‑grade GPUs.
Qualitative analysis shows SOLI captions retain fine details (e.g., “a red bicycle leaning against a brick wall”) that baseline models often miss on low‑res inputs.

Practical Implications

Edge deployment: Developers can embed SOLI into mobile apps, smart cameras, or assistive devices for visually impaired users without needing a cloud backend.
Cost‑effective scaling: Companies can run captioning services on cheaper hardware (e.g., commodity CPUs or low‑end GPUs), reducing operational expenses.
Robustness to bandwidth constraints: In IoT scenarios where images are transmitted at low resolution to save bandwidth, SOLI can still generate high‑quality descriptions.
Plug‑and‑play integration: Because SOLI uses standard CNN and LSTM components, it can be swapped into existing image‑captioning pipelines with minimal code changes.

Limitations & Future Work

Resolution ceiling: SOLI is tuned for very low resolutions (≤ 64×64); performance gains diminish on higher‑resolution images where heavyweight encoders already excel.
Super‑resolution dependency: The on‑the‑fly up‑sampling module adds a small overhead; future work could explore learned embeddings that bypass explicit super‑resolution.
Domain generalization: Experiments were limited to COCO‑style scenes and a small assistive‑tech dataset; broader domain testing (e.g., medical imaging) is needed.
Multilingual captions: The current decoder is English‑only; extending the framework to multilingual generation is an open avenue.

Overall, SOLI demonstrates that clever architectural choices—specifically a Siamese‑driven latent embedding—can close the performance gap for low‑resolution image captioning while staying lightweight enough for real‑world, resource‑constrained deployments.

Authors

Jing Jie Tan
Anissa Mokraoui
Ban-Hoe Kwan
Danny Wee-Kiat Ng
Yan-Chai Hum

Paper Information

arXiv ID: 2512.08873v1
Categories: cs.CV, cs.AI, cs.HC
Published: December 9, 2025
PDF: Download PDF

[Paper] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

[Paper] Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems