[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization

Published: (January 9, 2026 at 12:14 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.05955v1

Overview

A new federated learning framework called FaST‑PT tackles the long‑standing problem of domain shift when many edge devices (or “clients”) collaboratively train a model that must work on unseen data sources. By marrying lightweight multi‑modal style transfer with a clever prompt‑tuning scheme, the authors dramatically cut communication costs while still achieving state‑of‑the‑art generalization across domains.

Key Contributions

  • Multi‑Modal Style Transfer (MST) – a tiny, text‑guided image‑embedding augmentation that expands the effective training distribution without transmitting extra images.
  • Dual‑Prompt Architecture – separates prompts into a global component (learned from all clients) and a domain component (capturing client‑specific quirks).
  • Domain‑aware Prompt Generation (DPG) – a runtime module that selects the right mix of global and domain prompts per sample, enabling on‑the‑fly adaptation to new, unseen domains.
  • Efficiency Gains – the whole pipeline runs with far fewer communication rounds and lower compute footprints than prior FDG methods (e.g., FedDG‑GA, DiPrompt).
  • Extensive Validation – experiments on four cross‑domain benchmarks (PACS, DomainNet, etc.) show consistent accuracy improvements and ablation studies confirm each design choice.

Methodology

  1. Local Feature Augmentation via MST

    • Each client extracts image embeddings from a frozen vision‑language backbone (e.g., CLIP).
    • A lightweight style‑transfer network, conditioned on textual descriptions (e.g., “photo”, “sketch”), perturbs these embeddings to mimic the visual style of other domains.
    • Because only embeddings (not raw pixels) are exchanged, bandwidth usage stays minimal.
  2. Prompt Decomposition

    • Global Prompt: learned centrally from the aggregated, style‑augmented embeddings; encodes knowledge that should hold across any domain.
    • Domain Prompt: kept locally; captures nuances of the client’s own data distribution (camera type, lighting, etc.).
  3. Domain‑aware Prompt Generation (DPG)

    • For each incoming sample, DPG predicts a weighting vector that blends the global and domain prompts.
    • The blended prompt is then injected into the downstream classifier (or decoder), effectively “personalizing” the inference step without extra model parameters.
  4. Training Loop

    • Clients perform a few local SGD steps on their augmented embeddings and domain prompts.
    • Only the global prompt and a tiny MST parameter set are uploaded to the server each round.
    • The server averages the global prompts (standard federated averaging) and redistributes the updated version.

Results & Findings

DatasetPrior SOTA (FedDG‑GA)FaST‑PT (Ours)Relative ↑
PACS78.3 %84.1 %+5.8 %
DomainNet (Art)62.7 %69.4 %+6.7 %
Office‑Home71.5 %77.2 %+5.7 %
VLCS75.0 %80.3 %+5.3 %
  • Communication: FaST‑PT needs ~30 % fewer rounds to converge compared with DiPrompt.
  • Compute: The MST module adds <0.5 GFLOPs per client, negligible on modern edge GPUs/NPUs.
  • Ablation: Removing DPG drops accuracy by ~3 %; disabling MST (i.e., no style augmentation) reduces performance by ~4 %, confirming both are essential.

Practical Implications

  • Edge AI Deployments – Companies can train a single vision model across a fleet of smartphones, cameras, or IoT sensors while guaranteeing it works on brand‑new environments (e.g., a new store layout) without re‑collecting data.
  • Reduced Bandwidth Costs – Since only compact prompts and embedding‑level style parameters are exchanged, federated updates become viable even on low‑speed networks.
  • Plug‑and‑Play Compatibility – FaST‑PT works on top of any pre‑trained vision‑language backbone (CLIP, BLIP, etc.), so existing pipelines can adopt it with minimal code changes.
  • Rapid Prototyping – The DPG module can be exposed as an API that dynamically selects prompts based on runtime metadata (device type, GPS, etc.), enabling “smart” inference that adapts on the fly.

Limitations & Future Work

  • Text Supervision Dependency – MST relies on well‑crafted textual style cues; noisy or missing captions could degrade augmentation quality.
  • Scalability to Hundreds of Clients – Experiments capped at ~20 clients; the authors note potential challenges in prompt aggregation when client counts explode.
  • Domain Prompt Storage – Each client must retain its own domain prompt, which may become a memory concern on ultra‑constrained devices.

Future Directions suggested include:

  1. Automated style‑prompt generation via LLMs.
  2. Hierarchical prompt aggregation for massive client populations.
  3. Extending the approach to non‑visual modalities (audio, sensor data).

Authors

  • Yuliang Chen
  • Xi Lin
  • Jun Wu
  • Xiangrui Cai
  • Qiaolun Zhang
  • Xichun Fan
  • Jiapeng Xu
  • Xiu Su

Paper Information

  • arXiv ID: 2601.05955v1
  • Categories: cs.DC
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »