[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization

Published: 1 month ago (January 9, 2026 at 12:14 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.05955v1

Overview

A new federated learning framework called FaST‑PT tackles the long‑standing problem of domain shift when many edge devices (or “clients”) collaboratively train a model that must work on unseen data sources. By marrying lightweight multi‑modal style transfer with a clever prompt‑tuning scheme, the authors dramatically cut communication costs while still achieving state‑of‑the‑art generalization across domains.

Key Contributions

Multi‑Modal Style Transfer (MST) – a tiny, text‑guided image‑embedding augmentation that expands the effective training distribution without transmitting extra images.
Dual‑Prompt Architecture – separates prompts into a global component (learned from all clients) and a domain component (capturing client‑specific quirks).
Domain‑aware Prompt Generation (DPG) – a runtime module that selects the right mix of global and domain prompts per sample, enabling on‑the‑fly adaptation to new, unseen domains.
Efficiency Gains – the whole pipeline runs with far fewer communication rounds and lower compute footprints than prior FDG methods (e.g., FedDG‑GA, DiPrompt).
Extensive Validation – experiments on four cross‑domain benchmarks (PACS, DomainNet, etc.) show consistent accuracy improvements and ablation studies confirm each design choice.

Methodology

Local Feature Augmentation via MST
- Each client extracts image embeddings from a frozen vision‑language backbone (e.g., CLIP).
- A lightweight style‑transfer network, conditioned on textual descriptions (e.g., “photo”, “sketch”), perturbs these embeddings to mimic the visual style of other domains.
- Because only embeddings (not raw pixels) are exchanged, bandwidth usage stays minimal.
Prompt Decomposition
- Global Prompt: learned centrally from the aggregated, style‑augmented embeddings; encodes knowledge that should hold across any domain.
- Domain Prompt: kept locally; captures nuances of the client’s own data distribution (camera type, lighting, etc.).
Domain‑aware Prompt Generation (DPG)
- For each incoming sample, DPG predicts a weighting vector that blends the global and domain prompts.
- The blended prompt is then injected into the downstream classifier (or decoder), effectively “personalizing” the inference step without extra model parameters.
Training Loop
- Clients perform a few local SGD steps on their augmented embeddings and domain prompts.
- Only the global prompt and a tiny MST parameter set are uploaded to the server each round.
- The server averages the global prompts (standard federated averaging) and redistributes the updated version.

Results & Findings

Dataset	Prior SOTA (FedDG‑GA)	FaST‑PT (Ours)	Relative ↑
PACS	78.3 %	84.1 %	+5.8 %
DomainNet (Art)	62.7 %	69.4 %	+6.7 %
Office‑Home	71.5 %	77.2 %	+5.7 %
VLCS	75.0 %	80.3 %	+5.3 %

Communication: FaST‑PT needs ~30 % fewer rounds to converge compared with DiPrompt.
Compute: The MST module adds <0.5 GFLOPs per client, negligible on modern edge GPUs/NPUs.
Ablation: Removing DPG drops accuracy by ~3 %; disabling MST (i.e., no style augmentation) reduces performance by ~4 %, confirming both are essential.

Practical Implications

Edge AI Deployments – Companies can train a single vision model across a fleet of smartphones, cameras, or IoT sensors while guaranteeing it works on brand‑new environments (e.g., a new store layout) without re‑collecting data.
Reduced Bandwidth Costs – Since only compact prompts and embedding‑level style parameters are exchanged, federated updates become viable even on low‑speed networks.
Plug‑and‑Play Compatibility – FaST‑PT works on top of any pre‑trained vision‑language backbone (CLIP, BLIP, etc.), so existing pipelines can adopt it with minimal code changes.
Rapid Prototyping – The DPG module can be exposed as an API that dynamically selects prompts based on runtime metadata (device type, GPS, etc.), enabling “smart” inference that adapts on the fly.

Limitations & Future Work

Text Supervision Dependency – MST relies on well‑crafted textual style cues; noisy or missing captions could degrade augmentation quality.
Scalability to Hundreds of Clients – Experiments capped at ~20 clients; the authors note potential challenges in prompt aggregation when client counts explode.
Domain Prompt Storage – Each client must retain its own domain prompt, which may become a memory concern on ultra‑constrained devices.

Future Directions suggested include:

Automated style‑prompt generation via LLMs.
Hierarchical prompt aggregation for massive client populations.
Extending the approach to non‑visual modalities (audio, sensor data).

Authors

Yuliang Chen
Xi Lin
Jun Wu
Xiangrui Cai
Qiaolun Zhang
Xichun Fan
Jiapeng Xu
Xiu Su

Paper Information

arXiv ID: 2601.05955v1
Categories: cs.DC
Published: January 9, 2026
PDF: Download PDF

[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver

[Paper] LACIN: Linearly Arranged Complete Interconnection Networks

[Paper] Self-Evolving Distributed Memory Architecture for Scalable AI Systems

[Paper] Nalar: An agent serving framework