[Paper] Dynamic Rebatching for Efficient Early-Exit Inference with DREX

Published: (December 17, 2025 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15705v1

Overview

Early‑Exit (EE) techniques let large language models (LLMs) skip unnecessary layers for “easy” tokens, cutting inference latency. The paper Dynamic Rebatching for Efficient Early‑Exit Inference with DREX shows that conventional batching pipelines waste this opportunity because they treat a batch as a monolith—either all requests exit together or none do. The authors introduce Dynamic Rebatching, a runtime that reshapes the batch on‑the‑fly so each request can exit at the optimal layer without sacrificing throughput or quality.

Key Contributions

  • Dynamic Rebatching concept – Re‑organizes the batch at every early‑exit checkpoint, immediately finalizing tokens that satisfy the exit condition while buffering the rest for deeper processing.
  • DREX system – A production‑ready inference engine that implements dynamic rebatching with two performance‑boosting tricks:
    1. Copy‑free rebatching buffer – avoids costly memory copies when reshuffling requests.
    2. EE‑ and SLA‑aware scheduler – analytically predicts whether a rebatching step will improve overall latency/throughput, preventing harmful reorganizations.
  • Efficient KV‑cache handling – Provides a memory‑lightweight method to reconstruct missing key‑value cache entries for layers that were skipped, preserving the speed benefits of transformer caching.
  • Guarantee of no involuntary exits – DREX ensures that a request never exits earlier than the EE model’s own confidence threshold, protecting output quality.
  • Empirical validation – Demonstrates 2–12 % higher throughput over prior EE batching baselines while keeping the same generation quality.

Methodology

  1. Early‑Exit checkpoints – The model is instrumented with several “exit heads” after selected transformer layers. Each head produces a confidence score; if it exceeds a pre‑tuned threshold, the token can be emitted early.
  2. Dynamic rebatching loop:
    • At an exit point, the runtime scans the current batch.
    • Tokens that meet the confidence criterion are finalized and removed from further processing.
    • The remaining tokens are placed into a rebatching buffer that tracks their original positions but does not copy the underlying tensor data.
    • The buffer groups the pending tokens into a new batch (potentially of a different size) and forwards it to the next deeper layer.
  3. Copy‑free buffer design – Uses index‑based indirection (e.g., a vector of pointers/offsets) so the same underlying activation memory can be reused across rebatching steps, eliminating O(N) data movement.
  4. Scheduler analytics – For each potential rebatch, DREX estimates the cost/benefit trade‑off (extra compute vs. saved latency) using a lightweight model of layer latency, batch size scaling, and SLA constraints (e.g., max per‑token latency). If the predicted profit is negative, the scheduler postpones rebatching and keeps the current batch intact.
  5. KV‑cache reconstruction – When a token skips a layer, DREX synthesizes the missing key‑value entries by copying the nearest cached state and applying a cheap linear projection, keeping the cache size bounded.

Results & Findings

MetricBaseline EE batchingDREX (dynamic rebatching)
Throughput (tokens / s)1.00× (reference)1.02–1.12×
Average per‑token latency120 ms108–115 ms
Involuntary exit rate5–12 % of tokens0 %
Output quality (BLEU / ROUGE)BaselineIdentical (no degradation)
  • Throughput gains grow with larger batch sizes and deeper exit points because DREX can keep the GPU busy with a well‑filled batch while still letting early‑exit tokens leave promptly.
  • Zero involuntary exits eliminates a subtle quality bug that plagued earlier EE systems, where a token would be forced out early just to keep the batch uniform.
  • Scheduler accuracy: The analytical profit model correctly predicts profitable rebatches >95 % of the time, avoiding unnecessary reshuffles that would otherwise add overhead.

Practical Implications

  • Deployers of LLM APIs can integrate DREX to reduce cloud‑GPU costs while preserving the quality guarantees promised by early‑exit models.
  • Latency‑sensitive applications (e.g., real‑time code completion, conversational agents) benefit from lower tail latency because easy tokens finish instantly rather than waiting for the slowest request in the batch.
  • Framework developers (PyTorch, TensorFlow, Triton) can adopt the copy‑free rebatching buffer pattern to support dynamic batch sizes without incurring memory copy penalties.
  • SLA‑aware scheduling opens the door to hybrid workloads where some requests have strict latency caps while others prioritize throughput; DREX can automatically balance the two.
  • KV‑cache reconstruction shows a practical way to keep transformer caching effective even when layers are skipped, a pattern that could be reused for other conditional‑execution models (e.g., Mixture‑of‑Experts).

Limitations & Future Work

  • Model‑specific tuning – The exit thresholds and scheduler cost model need calibration per model architecture and hardware; a one‑size‑fits‑all configuration is not provided.
  • GPU memory fragmentation – Although copy‑free, the indirection buffers can lead to scattered memory accesses, which may affect performance on GPUs with limited cache.
  • Scalability to massive batch sizes – The reported gains plateau beyond ~256‑token batches; further research is needed to maintain benefits at extreme scales.
  • Extension to multi‑node inference – DREX currently targets single‑node GPU setups; distributed scenarios would require coordinated rebatching across nodes.
  • Broader EE strategies – The paper focuses on confidence‑based exits; future work could explore hybrid criteria (e.g., token‑level difficulty estimators) and integrate with other conditional execution techniques like adaptive depth or mixture‑of‑experts.

Authors

  • Xuting Liu
  • Daniel Alexander
  • Siva Kesava Reddy Kakarla
  • Behnaz Arzani
  • Vincent Liu

Paper Information

  • arXiv ID: 2512.15705v1
  • Categories: cs.DC, cs.LG
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »