[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization
Source: arXiv - 2601.05955v1
Overview
A new federated learning framework called FaST‑PT tackles the long‑standing problem of domain shift when many edge devices (or “clients”) collaboratively train a model that must work on unseen data sources. By marrying lightweight multi‑modal style transfer with a clever prompt‑tuning scheme, the authors dramatically cut communication costs while still achieving state‑of‑the‑art generalization across domains.
Key Contributions
- Multi‑Modal Style Transfer (MST) – a tiny, text‑guided image‑embedding augmentation that expands the effective training distribution without transmitting extra images.
- Dual‑Prompt Architecture – separates prompts into a global component (learned from all clients) and a domain component (capturing client‑specific quirks).
- Domain‑aware Prompt Generation (DPG) – a runtime module that selects the right mix of global and domain prompts per sample, enabling on‑the‑fly adaptation to new, unseen domains.
- Efficiency Gains – the whole pipeline runs with far fewer communication rounds and lower compute footprints than prior FDG methods (e.g., FedDG‑GA, DiPrompt).
- Extensive Validation – experiments on four cross‑domain benchmarks (PACS, DomainNet, etc.) show consistent accuracy improvements and ablation studies confirm each design choice.
Methodology
-
Local Feature Augmentation via MST
- Each client extracts image embeddings from a frozen vision‑language backbone (e.g., CLIP).
- A lightweight style‑transfer network, conditioned on textual descriptions (e.g., “photo”, “sketch”), perturbs these embeddings to mimic the visual style of other domains.
- Because only embeddings (not raw pixels) are exchanged, bandwidth usage stays minimal.
-
Prompt Decomposition
- Global Prompt: learned centrally from the aggregated, style‑augmented embeddings; encodes knowledge that should hold across any domain.
- Domain Prompt: kept locally; captures nuances of the client’s own data distribution (camera type, lighting, etc.).
-
Domain‑aware Prompt Generation (DPG)
- For each incoming sample, DPG predicts a weighting vector that blends the global and domain prompts.
- The blended prompt is then injected into the downstream classifier (or decoder), effectively “personalizing” the inference step without extra model parameters.
-
Training Loop
- Clients perform a few local SGD steps on their augmented embeddings and domain prompts.
- Only the global prompt and a tiny MST parameter set are uploaded to the server each round.
- The server averages the global prompts (standard federated averaging) and redistributes the updated version.
Results & Findings
| Dataset | Prior SOTA (FedDG‑GA) | FaST‑PT (Ours) | Relative ↑ |
|---|---|---|---|
| PACS | 78.3 % | 84.1 % | +5.8 % |
| DomainNet (Art) | 62.7 % | 69.4 % | +6.7 % |
| Office‑Home | 71.5 % | 77.2 % | +5.7 % |
| VLCS | 75.0 % | 80.3 % | +5.3 % |
- Communication: FaST‑PT needs ~30 % fewer rounds to converge compared with DiPrompt.
- Compute: The MST module adds <0.5 GFLOPs per client, negligible on modern edge GPUs/NPUs.
- Ablation: Removing DPG drops accuracy by ~3 %; disabling MST (i.e., no style augmentation) reduces performance by ~4 %, confirming both are essential.
Practical Implications
- Edge AI Deployments – Companies can train a single vision model across a fleet of smartphones, cameras, or IoT sensors while guaranteeing it works on brand‑new environments (e.g., a new store layout) without re‑collecting data.
- Reduced Bandwidth Costs – Since only compact prompts and embedding‑level style parameters are exchanged, federated updates become viable even on low‑speed networks.
- Plug‑and‑Play Compatibility – FaST‑PT works on top of any pre‑trained vision‑language backbone (CLIP, BLIP, etc.), so existing pipelines can adopt it with minimal code changes.
- Rapid Prototyping – The DPG module can be exposed as an API that dynamically selects prompts based on runtime metadata (device type, GPS, etc.), enabling “smart” inference that adapts on the fly.
Limitations & Future Work
- Text Supervision Dependency – MST relies on well‑crafted textual style cues; noisy or missing captions could degrade augmentation quality.
- Scalability to Hundreds of Clients – Experiments capped at ~20 clients; the authors note potential challenges in prompt aggregation when client counts explode.
- Domain Prompt Storage – Each client must retain its own domain prompt, which may become a memory concern on ultra‑constrained devices.
Future Directions suggested include:
- Automated style‑prompt generation via LLMs.
- Hierarchical prompt aggregation for massive client populations.
- Extending the approach to non‑visual modalities (audio, sensor data).
Authors
- Yuliang Chen
- Xi Lin
- Jun Wu
- Xiangrui Cai
- Qiaolun Zhang
- Xichun Fan
- Jiapeng Xu
- Xiu Su
Paper Information
- arXiv ID: 2601.05955v1
- Categories: cs.DC
- Published: January 9, 2026
- PDF: Download PDF