[Paper] PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping

Published: (February 9, 2026 at 09:50 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08747v1

Overview

Modern AI services often stitch together several deep‑neural‑network (DNN) models into an inference pipeline that must return results within tight latency budgets. When too many requests pile up, many miss their deadlines and are eventually timed‑out. Existing systems react by dropping requests only after a timeout is imminent, which can waste compute and still leave many requests unfinished. The paper PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping proposes a proactive dropping strategy that decides when and which requests to drop before they become a problem, dramatically improving the amount of useful work (goodput) the system delivers.

Key Contributions

  • Proactive Dropping Framework (PARD): Introduces a runtime‑aware controller that predicts overload early and drops requests pre‑emptively.
  • Adaptive Priority Scheduler: Dynamically assigns priorities to in‑flight requests based on their remaining latency budget and current workload intensity, ensuring the “right” requests survive.
  • Comprehensive Evaluation: Shows on a 64‑GPU cluster that PARD boosts goodput by 16 %–176 % over the best prior systems while cutting the overall drop rate and wasted GPU cycles by up to 17× and 62×, respectively.
  • Generalizable Design: Works with any multi‑model inference pipeline and does not require changes to the underlying DNN models.

Methodology

  1. Runtime Monitoring: PARD continuously measures queue lengths, per‑stage processing times, and the remaining latency budget for each request as it traverses the pipeline.
  2. Predictive Overload Detection: Using these metrics, a lightweight controller estimates whether the pipeline will miss deadlines in the near future.
  3. When‑to‑Drop Decision: If an overload is predicted, the controller triggers a dropping window—a short interval during which some requests will be culled.
  4. Which‑to‑Drop Selection: Each in‑flight request receives a priority score:
    • Higher priority → larger remaining latency budget, lower computational cost, or belonging to a high‑value service tier.
    • Lower priority → tight deadline, heavy compute, or low‑value tier.
      The controller drops the lowest‑priority requests first, freeing resources for the rest.
  5. Feedback Loop: After each dropping window, the system re‑evaluates the workload and adjusts the aggressiveness of future dropping (e.g., widening or narrowing the window).

The whole pipeline remains unchanged; PARD sits as a thin orchestration layer that can be deployed on existing inference serving stacks (e.g., TensorRT‑Inference Server, Triton).

Results & Findings

MetricBaseline (reactive dropping)PARD
Goodput (useful requests per second)1.0× (reference)1.16× – 2.76×
Overall drop rate12 %6 % – 7.5 %
Wasted GPU compute (cycles spent on dropped requests)1.0×0.06× – 0.63×
Latency‑budget miss probability8 %<2 %

Key takeaways:

  • Early dropping prevents the pipeline from becoming saturated, keeping the queue shallow and reducing tail latency.
  • Priority‑aware selection ensures that high‑value or low‑cost requests survive, which directly translates into higher goodput.
  • The system scales: on a 64‑GPU cluster handling realistic workloads (image classification, object detection, recommendation), the gains hold across different model depths and batch sizes.

Practical Implications

  • For Cloud AI Providers: PARD can be integrated into inference serving platforms to squeeze more revenue out of existing GPU fleets without adding hardware.
  • Edge & On‑Device Deployments: Devices with limited compute (e.g., autonomous drones, AR glasses) can use proactive dropping to guarantee real‑time responses while conserving battery.
  • SLA‑Aware Services: SaaS products that promise sub‑100 ms latency can adopt PARD to meet SLAs more reliably, reducing penalty costs.
  • Developer Tooling: The priority API is simple (set budget, weight, tier) and can be exposed in SDKs, letting developers fine‑tune which requests are “mission‑critical.”

Overall, PARD shifts the mindset from “drop only when you must” to “drop smartly before you must,” a change that can be realized with minimal code changes and measurable ROI.

Limitations & Future Work

  • Prediction Accuracy: The overload estimator relies on short‑term statistics; sudden spikes (e.g., flash crowds) may still cause occasional deadline misses.
  • Priority Configuration Overhead: Determining optimal priority weights for heterogeneous services can be non‑trivial; automated tuning is left for future research.
  • Model‑Specific Optimizations: PARD treats all stages uniformly; deeper integration with model‑level profiling could further improve decisions.
  • Extending Beyond GPU Clusters: The authors plan to explore how the approach works on heterogeneous hardware (TPUs, FPGAs) and in multi‑tenant environments where resource isolation adds complexity.

Authors

  • Zhixin Zhao
  • Yitao Hu
  • Simin Chen
  • Mingfang Ji
  • Wei Yang
  • Yuhao Zhang
  • Laiping Zhao
  • Wenxin Li
  • Xiulong Liu
  • Wenyu Qu
  • Hao Wang

Paper Information

  • arXiv ID: 2602.08747v1
  • Categories: cs.DC
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »