[Paper] MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone Multiplexing
Source: arXiv - 2603.02885v1
Overview
Fine‑tuning large language models (LLMs) for many customers is a core service in modern AI datacenters. The usual approach—running a separate PEFT (parameter‑efficient fine‑tuning) instance for each request—leaves GPUs half‑empty and creates costly stalls when tasks contend for compute and communication bandwidth. MuxTune proposes a system that shares the LLM backbone across many fine‑tuning jobs, multiplexing it in both space (parallel operator execution) and time (task interleaving) to squeeze out higher utilization and dramatically lower memory footprints.
Key Contributions
- Unified PEFT representation that abstracts diverse fine‑tuning methods (e.g., LoRA, adapters, prefix‑tuning) into a common backbone‑sharing format.
- Hierarchical co‑scheduling across three levels (task, operator, data) that decides when and where each PEFT task runs.
- Hybrid spatial‑temporal multiplexing: tasks are fused so that different layers of the backbone can serve multiple fine‑tuning streams simultaneously, while still preserving each task’s logical order.
- Two‑tiered hybrid parallelism that blends data‑parallel and pipeline‑parallel execution for the shared backbone, reducing idle GPU cycles.
- Chunk‑based data alignment that groups tokens from different tasks into “effective” chunks, eliminating wasted computation on padding or task‑specific tokens.
- Empirical gains: up to 2.33× higher throughput and 5.29× lower memory usage versus three leading PEFT serving baselines.
Methodology
-
Modular Backbone Abstraction
- The LLM’s transformer layers are treated as a shared service that can be invoked by any PEFT task.
- Each task’s lightweight adapters are attached as plug‑in modules, allowing the same core weights to be reused without duplication.
-
Hierarchical Co‑Scheduling
- Task‑level: a global scheduler groups compatible fine‑tuning jobs (similar batch sizes, token lengths) into a fusion group.
- Operator‑level: within a group, the scheduler decides which transformer sub‑layer (e.g., attention, feed‑forward) runs in parallel across tasks (spatial multiplexing) and which runs sequentially (temporal multiplexing).
- Data‑level: input sequences are sliced into chunks that align across tasks, so that a single GPU kernel processes a mixed batch of tokens from several jobs at once.
-
Two‑Tier Hybrid Parallelism
- Tier 1 (intra‑task): classic data‑parallelism for each task’s adapters, keeping gradient updates local.
- Tier 2 (inter‑task): pipeline‑parallelism across the shared backbone, allowing the next task’s chunk to start while the previous one finishes a later layer.
-
Implementation Details
- Built on top of PyTorch and NVIDIA’s NCCL for fast inter‑GPU communication.
- Custom CUDA kernels handle the mixed‑token chunks, avoiding the overhead of launching separate kernels per task.
- A lightweight runtime monitors GPU memory pressure and dynamically reshapes fusion groups to stay within memory limits.
Results & Findings
| Baseline | Throughput (samples/s) | GPU Memory (GB) |
|---|---|---|
| Single‑Task PEFT (state‑of‑the‑art) | 1.0× (reference) | 12 |
| Parallel‑Task Naïve (no sharing) | 0.78× | 18 |
| Existing Multi‑Task PEFT System | 1.45× | 9 |
| MuxTune | 2.33× | 2.3 GB (≈5.29× reduction) |
- Throughput scales almost linearly with the number of concurrent tasks up to the point where the shared backbone becomes the bottleneck; beyond that, the scheduler automatically throttles new tasks.
- Memory savings come primarily from storing a single copy of the backbone weights and reusing them across tasks; adapters remain the only per‑task overhead.
- Latency impact is modest: the hybrid temporal multiplexing adds ≤ 15 ms per batch, which is negligible for most fine‑tuning API workloads.
- Scalability tests on 8‑GPU clusters show consistent gains, confirming that the approach works across both single‑node and multi‑node deployments.
Practical Implications
- Cost Reduction for AI Service Providers – By cutting memory usage > 5×, providers can fit more fine‑tuning jobs on the same GPU fleet, lowering hardware spend and energy consumption.
- Higher SLA Fulfilment – The throughput boost means lower request queuing times, translating to tighter latency SLAs for customers who need rapid model customization.
- Simplified Ops – Operators no longer need to spin up a dedicated container per fine‑tuning request; a single MuxTune service can host dozens of concurrent jobs, easing orchestration and monitoring.
- Developer Flexibility – Since MuxTune works with any PEFT method that can be expressed in the unified representation, developers can continue using their preferred adapters without code changes.
- Potential for Edge‑to‑Cloud Continuity – The same multiplexing ideas could be applied to smaller GPU clusters at the edge, enabling on‑prem fine‑tuning with the same efficiency gains.
Limitations & Future Work
- Task Compatibility Constraints – Fusion groups require similar sequence lengths and batch sizes; highly heterogeneous workloads may still need separate instances.
- Scheduler Overhead – The hierarchical scheduler introduces some CPU overhead, which could become noticeable at extreme scale (hundreds of concurrent tasks).
- Model Size Bound – Experiments focused on 7‑B to 13‑B parameter models; scaling to 70‑B‑plus models may need additional memory‑aware partitioning strategies.
- Future Directions – The authors plan to (1) extend the unified PEFT abstraction to include retrieval‑augmented fine‑tuning, (2) integrate reinforcement‑learning‑based scheduling for dynamic workloads, and (3) explore hardware‑level support (e.g., NVIDIA Hopper’s tensor‑core scheduling) to further reduce kernel launch latency.
Authors
- Chunyu Xue
- Yi Pan
- Weihao Cui
- Quan Chen
- Shulai Zhang
- Bingsheng He
- Minyi Guo
Paper Information
- arXiv ID: 2603.02885v1
- Categories: cs.DC
- Published: March 3, 2026
- PDF: Download PDF