[Paper] Trident: Adaptive Scheduling for Heterogeneous Multimodal Data Pipelines
Source: arXiv - 2603.02075v1
Overview
Multimodal AI pipelines—think PDF‑to‑text extraction, video captioning, or image‑plus‑text search—mix heavy CPU preprocessing with GPU/TPU inference. Because the workload constantly shifts (different input sizes, variable‑length models, occasional memory spikes), static schedulers either waste resources or crash with out‑of‑memory errors. Trident is a new adaptive scheduling framework that watches the pipeline in real time, predicts how fast each operator can run, and continuously re‑optimizes placement and parallelism on a fixed cluster. The result is up to 2× higher throughput without adding hardware.
Key Contributions
- Three‑layer closed‑loop scheduler that (1) observes per‑operator throughput with Gaussian‑Process regression, (2) detects workload regime changes and runs memory‑aware Bayesian optimization, and (3) solves a mixed‑integer linear program to jointly decide parallelism, device placement, and safe configuration transitions.
- Anomaly‑filtered GP model that can handle the noisy, bursty performance signals typical of asynchronous, heterogeneous operators.
- Memory‑constrained Bayesian optimizer that guarantees any suggested configuration stays OOM‑safe, even when the pipeline’s memory footprint spikes.
- Rolling‑update scheduling that accounts for cold‑start costs, enabling smooth transitions without halting the whole pipeline.
- Integration with Ray Data and demonstration on real‑world document‑ and video‑curation pipelines, showing up to 2.01× and 1.88× throughput gains respectively.
Methodology
-
Observation Layer – Each operator (e.g., PDF parsing, OCR, video decoding, transformer inference) reports its current throughput and memory usage. Trident fits a Gaussian Process (GP) model to these measurements, automatically discarding outliers caused by transient spikes. The GP predicts the sustainable throughput for any parallelism level.
-
Adaptation Layer – A lightweight change‑point detector watches the GP predictions. When a shift is detected (e.g., a batch of longer PDFs arrives), Trident launches a Bayesian optimization loop that searches the space of parallelism and device‑placement settings subject to a hard memory budget. The optimizer only returns configurations that the GP predicts will stay within memory limits.
-
Scheduling Layer – The chosen configuration is fed into a mixed‑integer linear program (MILP). The MILP simultaneously decides:
- How many replicas of each operator to run (parallelism).
- Which hardware (CPU, GPU, NPU, TPU) each replica should occupy.
- When to roll the new configuration in, balancing the cost of cold‑starts (model loading, data warm‑up) against the expected throughput gain.
The MILP respects cluster‑wide constraints such as total GPU memory, PCIe bandwidth, and CPU core counts.
-
Feedback Loop – Once the new schedule is active, Trident invalidates any stale GP samples (because the environment changed) and starts collecting fresh observations, keeping the model up‑to‑date.
All of this runs online with sub‑second overhead, making it suitable for production services that cannot afford long re‑optimization pauses.
Results & Findings
| Pipeline | Baseline (static) | Trident (adaptive) | Speed‑up | Memory safety |
|---|---|---|---|---|
| PDF document curation (CPU‑heavy preprocessing + GPU OCR) | 120 docs/s | 242 docs/s | 2.01× | No OOM incidents |
| Video curation (decode → frame‑level model → metadata) | 45 clips/s | 85 clips/s | 1.88× | No OOM incidents |
| Overhead | – | < 5 % of total runtime | – | – |
Key observations
- Throughput gains are highest when the workload exhibits frequent regime shifts (e.g., mixed‑size PDFs). The adaptive loop quickly ramps up parallelism for heavy batches and scales down for lighter ones, keeping GPU utilization near 90 %.
- Memory‑aware optimization eliminates OOM crashes that plagued the static baseline during peak memory spikes (e.g., processing a high‑resolution video).
- The MILP solves in < 200 ms on a 32‑core control node, meaning schedule updates can happen multiple times per minute without hurting latency.
Practical Implications
- For AI platform engineers: Trident can be dropped into existing Ray Data pipelines (or similar data‑flow frameworks) to automatically extract more performance from the same hardware, reducing cloud spend.
- For ML Ops teams: The memory‑safe Bayesian optimizer removes the need for manual “max‑batch‑size” tuning, a common source of production incidents.
- For developers building multimodal services: You can now mix CPU‑bound preprocessing (e.g., PDF parsing, video decoding) with accelerator‑backed inference without hand‑crafting per‑operator scaling rules. Trident’s rolling updates keep latency stable during re‑configuration, which is critical for SLAs.
- For cloud providers: The approach demonstrates that smarter scheduling can double throughput on existing clusters, potentially delaying the need for costly hardware upgrades.
Limitations & Future Work
- Fixed‑resource assumption – Trident optimizes within a static cluster; it does not currently trigger horizontal scaling (adding/removing nodes). Extending the loop to include autoscaling decisions would broaden applicability.
- Modeling overhead – Gaussian Process regression scales cubically with the number of observations; the current implementation uses a sliding window to keep the dataset small, which may discard long‑term trends. More scalable surrogate models (e.g., deep kernel learning) could improve accuracy.
- Operator granularity – The framework assumes operators expose throughput and memory metrics. Black‑box stages (e.g., third‑party services) need instrumentation or proxy wrappers.
- Generalization beyond Ray Data – While the concepts are portable, integrating Trident with other orchestration systems (Kubernetes, Dask) will require adapters for their scheduling APIs.
Future research directions include multi‑cluster coordination, integration with cost‑aware cloud billing APIs, and exploring reinforcement‑learning‑based schedulers that can learn from longer deployment histories.
Authors
- Ding Pan
- Zhuangzhuang Zhou
- Long Qian
- Binhang Yuan
Paper Information
- arXiv ID: 2603.02075v1
- Categories: cs.DC
- Published: March 2, 2026
- PDF: Download PDF