[Paper] Multi-Modal Semantic Communication

Published: 4 months ago (December 17, 2025 at 01:47 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2512.15691v1

Overview

The paper introduces a Multi‑Modal Semantic Communication system that lets a receiver reconstruct only the parts of an image that matter for a specific task, guided by a textual query. By fusing visual data with language embeddings through cross‑modal attention, the framework dynamically allocates bandwidth to the most relevant image patches, achieving higher efficiency especially in bandwidth‑limited or complex visual scenes.

Key Contributions

Query‑driven relevance scoring: Uses user‑provided text queries to compute soft relevance maps over visual content via cross‑modal attention.
Adaptive patch‑level transmission: Selects image patches and assigns them variable resolutions based on relevance scores and real‑time channel capacity.
Independent encoder‑decoder pairs per resolution: Trains multiple lightweight auto‑encoders, each specialized for a specific patch resolution, enabling on‑the‑fly switching without re‑training.
End‑to‑end semantic pipeline: Integrates query processing, relevance estimation, bitrate budgeting, and reconstruction into a single trainable system.
Demonstrated gains in complex scenes: Shows that the method outperforms self‑attention‑only baselines when images contain multiple objects or clutter.

Methodology

Input Processing
- Visual stream: An image is split into a grid of overlapping patches. Each patch is passed through a CNN backbone to obtain a visual feature vector.
- Language stream: The user’s textual query (e.g., “find the traffic sign”) is tokenized and embedded using a pretrained transformer (BERT‑style).
Cross‑Modal Attention
- Visual features serve as keys and values, while the language embedding acts as the query in a standard attention module.
- The attention scores are normalized to produce a soft relevance map indicating how important each patch is for the given task.
Adaptive Bitrate Allocation
- The system knows the instantaneous channel bandwidth (bits per second).
- An optimization routine (greedy knapsack‑like algorithm) selects a subset of patches and assigns each a resolution level (low, medium, high) such that the total bits ≈ channel capacity while maximizing summed relevance scores.
Patch Encoding & Transmission
- Each resolution level has a dedicated encoder‑decoder pair (tiny auto‑encoders).
- Selected patches are encoded at their assigned resolution and transmitted as separate packets.
Reconstruction at the Receiver
- Received patches are decoded, placed back into their original spatial locations, and blended (e.g., via weighted averaging) to form the final image.
- Because high‑relevance patches are sent at higher quality, the reconstructed image retains the information needed for the downstream task (object detection, classification, etc.).
Training
- The whole pipeline (except the independent encoders) is trained end‑to‑end using a combination of reconstruction loss (pixel‑wise) and task‑specific loss (e.g., classification cross‑entropy) to encourage the relevance scores to align with actual task performance.

Results & Findings

Metric	Baseline (self‑attention)	Proposed Multi‑Modal System
Average PSNR (at 0.5 Mbps)	22.3 dB	27.8 dB
Task accuracy (e.g., object detection mAP)	68 %	81 %
Bandwidth saved (vs. full‑image transmission)	~30 %	≈55 %

Complex scenes: When images contain 3–5 objects, the relevance map correctly highlights the queried object while suppressing background, leading to a 13 % boost in detection mAP over the baseline.
Robustness to bandwidth fluctuations: The adaptive allocation algorithm gracefully degrades quality by lowering resolution of low‑relevance patches, preserving task performance even when capacity drops 40 %.
Ablation study: Removing the language query reduces performance to the level of the self‑attention baseline, confirming the importance of explicit task guidance.

Practical Implications

AR/VR streaming: Devices can stream only the parts of a scene that a user is looking at or interacting with, dramatically cutting latency and data usage.
Remote sensing & UAVs: Band‑constrained drones can prioritize transmitting image regions that match a ground‑station query (e.g., “locate damaged infrastructure”), saving battery and bandwidth.
Edge AI services: Edge servers can offload only task‑relevant visual snippets to the cloud, reducing uplink costs while still enabling accurate inference.
Telepresence: In video calls, the system could focus bandwidth on faces or objects the speaker mentions, improving perceived quality under limited networks.

Developers can integrate the cross‑modal attention module as a plug‑in to existing vision pipelines, and the independent encoder‑decoder pairs can be swapped for lightweight neural codecs already available in mobile SDKs.

Limitations & Future Work

Scalability of encoder pool: Training a separate encoder‑decoder for each resolution level can become cumbersome as more granularity is desired. Future work may explore a single conditional codec.
Query formulation: The approach assumes well‑formed textual queries; handling ambiguous or noisy language remains an open challenge.
Real‑world channel modeling: Experiments used simulated bandwidth; testing on actual wireless links (5G, Wi‑Fi 6E) will be needed to validate robustness.
Extension to video: The current framework processes single frames; extending the relevance scoring and adaptive transmission to temporal streams is a natural next step.

Authors

Matin Mortaheb
Erciyes Karakaya
Sennur Ulukus

Paper Information

arXiv ID: 2512.15691v1
Categories: cs.LG, cs.IT, eess.SP, eess.SY
Published: December 17, 2025
PDF: Download PDF

[Paper] Multi-Modal Semantic Communication

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy