[Paper] PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping

Published: 3 days ago (February 9, 2026 at 09:50 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.08747v1

Overview

Modern AI services often stitch together several deep‑neural‑network (DNN) models into an inference pipeline that must return results within tight latency budgets. When too many requests pile up, many miss their deadlines and are eventually timed‑out. Existing systems react by dropping requests only after a timeout is imminent, which can waste compute and still leave many requests unfinished. The paper PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping proposes a proactive dropping strategy that decides when and which requests to drop before they become a problem, dramatically improving the amount of useful work (goodput) the system delivers.

Key Contributions

Proactive Dropping Framework (PARD): Introduces a runtime‑aware controller that predicts overload early and drops requests pre‑emptively.
Adaptive Priority Scheduler: Dynamically assigns priorities to in‑flight requests based on their remaining latency budget and current workload intensity, ensuring the “right” requests survive.
Comprehensive Evaluation: Shows on a 64‑GPU cluster that PARD boosts goodput by 16 %–176 % over the best prior systems while cutting the overall drop rate and wasted GPU cycles by up to 17× and 62×, respectively.
Generalizable Design: Works with any multi‑model inference pipeline and does not require changes to the underlying DNN models.

Methodology

Runtime Monitoring: PARD continuously measures queue lengths, per‑stage processing times, and the remaining latency budget for each request as it traverses the pipeline.
Predictive Overload Detection: Using these metrics, a lightweight controller estimates whether the pipeline will miss deadlines in the near future.
When‑to‑Drop Decision: If an overload is predicted, the controller triggers a dropping window—a short interval during which some requests will be culled.
Which‑to‑Drop Selection: Each in‑flight request receives a priority score:
- Higher priority → larger remaining latency budget, lower computational cost, or belonging to a high‑value service tier.
- Lower priority → tight deadline, heavy compute, or low‑value tier.
  The controller drops the lowest‑priority requests first, freeing resources for the rest.
Feedback Loop: After each dropping window, the system re‑evaluates the workload and adjusts the aggressiveness of future dropping (e.g., widening or narrowing the window).

The whole pipeline remains unchanged; PARD sits as a thin orchestration layer that can be deployed on existing inference serving stacks (e.g., TensorRT‑Inference Server, Triton).

Results & Findings

Metric	Baseline (reactive dropping)	PARD
Goodput (useful requests per second)	1.0× (reference)	1.16× – 2.76×
Overall drop rate	12 %	6 % – 7.5 %
Wasted GPU compute (cycles spent on dropped requests)	1.0×	0.06× – 0.63×
Latency‑budget miss probability	8 %	<2 %

Key takeaways:

Early dropping prevents the pipeline from becoming saturated, keeping the queue shallow and reducing tail latency.
Priority‑aware selection ensures that high‑value or low‑cost requests survive, which directly translates into higher goodput.
The system scales: on a 64‑GPU cluster handling realistic workloads (image classification, object detection, recommendation), the gains hold across different model depths and batch sizes.

Practical Implications

For Cloud AI Providers: PARD can be integrated into inference serving platforms to squeeze more revenue out of existing GPU fleets without adding hardware.
Edge & On‑Device Deployments: Devices with limited compute (e.g., autonomous drones, AR glasses) can use proactive dropping to guarantee real‑time responses while conserving battery.
SLA‑Aware Services: SaaS products that promise sub‑100 ms latency can adopt PARD to meet SLAs more reliably, reducing penalty costs.
Developer Tooling: The priority API is simple (set budget, weight, tier) and can be exposed in SDKs, letting developers fine‑tune which requests are “mission‑critical.”

Overall, PARD shifts the mindset from “drop only when you must” to “drop smartly before you must,” a change that can be realized with minimal code changes and measurable ROI.

Limitations & Future Work

Prediction Accuracy: The overload estimator relies on short‑term statistics; sudden spikes (e.g., flash crowds) may still cause occasional deadline misses.
Priority Configuration Overhead: Determining optimal priority weights for heterogeneous services can be non‑trivial; automated tuning is left for future research.
Model‑Specific Optimizations: PARD treats all stages uniformly; deeper integration with model‑level profiling could further improve decisions.
Extending Beyond GPU Clusters: The authors plan to explore how the approach works on heterogeneous hardware (TPUs, FPGAs) and in multi‑tenant environments where resource isolation adds complexity.

Authors

Zhixin Zhao
Yitao Hu
Simin Chen
Mingfang Ji
Wei Yang
Yuhao Zhang
Laiping Zhao
Wenxin Li
Xiulong Liu
Wenyu Qu
Hao Wang

Paper Information

arXiv ID: 2602.08747v1
Categories: cs.DC
Published: February 9, 2026
PDF: Download PDF

[Paper] PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Min-Sum Uniform Coverage Problem by Autonomous Mobile Robots

[Paper] BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

[Paper] Computing Least Fixed Points with Overwrite Semantics in Parallel and Distributed Systems

[Paper] Implementability of Global Distributed Protocols modulo Network Architectures