[Paper] Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

Published: (February 18, 2026 at 01:05 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.16664v1

Overview

The paper introduces Self‑Supervised Semantic Bridge (SSB), a new way to translate images between domains without needing paired examples or explicit adversarial training on the target domain. By injecting a geometry‑preserving semantic representation into diffusion‑based models, SSB achieves high‑fidelity, spatially consistent translations that work even on data the model has never seen—an especially valuable property for medical imaging and text‑guided editing.

Key Contributions

  • Semantic bridge architecture that couples a self‑supervised visual encoder with diffusion “bridge” models, creating a shared latent space that is invariant to appearance but sensitive to structure.
  • Elimination of cross‑domain adversarial loss, allowing the model to generalize to unseen target domains and reducing the need for costly domain‑specific discriminators.
  • Improved inversion quality: the semantic latent conditions guide the diffusion process, mitigating the blur and artifacts typical of diffusion‑inversion pipelines.
  • Strong empirical results on challenging medical image synthesis tasks (e.g., MRI ↔ CT, pathology slides) showing superior performance both in‑domain and out‑of‑domain.
  • Straightforward extension to text‑guided editing, demonstrating that the same bridge can be steered by natural‑language prompts without retraining.

Methodology

  1. Self‑supervised encoder – A convolutional (or vision‑transformer) encoder is trained with a contrastive loss (e.g., SimCLR, MoCo) on a large collection of images. The encoder learns to map any image to a semantic vector that stays stable when colors, textures, or lighting change, but captures the underlying layout and shapes.

  2. Diffusion bridge – Two diffusion models are trained: one that maps a source image to a latent noise space, and another that reconstructs the target image from that noise. Unlike classic diffusion‑inversion, the bridge is conditioned on the semantic vector from step 1.

  3. Training without target‑domain adversaries – The only supervision needed is the self‑supervised semantic loss; the diffusion models are trained to denoise conditioned on the semantic code. This removes the need for a GAN‑style discriminator that must see target‑domain samples.

  4. Inference – To translate an image, we (a) encode it to obtain its semantic code, (b) run the forward diffusion to obtain a noisy latent, and (c) run the reverse diffusion conditioned on the same semantic code (or a modified one, e.g., from a text prompt) to generate the target‑domain image.

The overall pipeline can be visualized as a “bridge” that carries the geometry of the source image across domains while letting the diffusion model fill in the appropriate appearance.

Results & Findings

TaskMetric (higher = better)SSB vs. Best Prior
MRI → CT (in‑domain)SSIM: 0.92 vs. 0.84 (GAN)
Histopathology style transfer (out‑of‑domain)FID: 12.3 vs. 23.7 (Diffusion‑Inversion)
Text‑guided facial editingUser study preference: 78 % choose SSB outputs
  • Spatial fidelity: Edge preservation and organ shape consistency were markedly higher than baseline methods, confirmed by both quantitative (SSIM, Dice) and radiologist visual assessments.
  • Generalization: When the model was tested on a completely new imaging modality (e.g., PET scans) without any fine‑tuning, performance degraded only modestly, demonstrating the robustness of the semantic bridge.
  • Speed: Because the semantic encoder is lightweight and the diffusion steps are shared across domains, inference time is comparable to state‑of‑the‑art diffusion‑inversion (≈ 1 s per 256×256 image on an RTX 3090).

Practical Implications

  • Medical imaging pipelines can now synthesize missing modalities (e.g., generate CT from MRI) without collecting paired datasets, reducing patient exposure and acquisition costs.
  • Developers building cross‑domain style transfer tools (e.g., turning sketches into realistic renders) can leverage SSB to avoid training a separate GAN for each target style.
  • Text‑to‑image editors gain a plug‑and‑play conditioning mechanism: swapping the semantic code with a text‑derived embedding yields controllable edits without retraining the diffusion model.
  • Deployment friendliness – Since the approach does not rely on adversarial training, it sidesteps stability issues and can be fine‑tuned on modest hardware, making it attractive for startups and research labs alike.

Limitations & Future Work

  • The quality of the semantic bridge hinges on the self‑supervised encoder; if the pretraining data lacks certain structures (e.g., rare anatomical anomalies), the model may struggle to preserve them.
  • While the method reduces reliance on target‑domain data, it still requires a reasonably large corpus of source‑domain images for encoder pretraining.
  • The current diffusion backbone operates at moderate resolutions (≤ 256 px); scaling to ultra‑high‑resolution medical scans will need memory‑efficient diffusion variants.
  • Future directions include jointly learning the encoder and diffusion bridge (instead of a two‑stage pipeline) and exploring multimodal semantic codes that combine text, segmentation masks, or clinical metadata for richer conditional control.

Authors

  • Jiaming Liu
  • Felix Petersen
  • Yunhe Gao
  • Yabin Zhang
  • Hyojin Kim
  • Akshay S. Chaudhari
  • Yu Sun
  • Stefano Ermon
  • Sergios Gatidis

Paper Information

  • arXiv ID: 2602.16664v1
  • Categories: cs.CV
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »