[Paper] Dynamic Rebatching for Efficient Early-Exit Inference with DREX

Published: 1 month ago (December 17, 2025 at 01:55 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15705v1

Overview

Early‑Exit (EE) techniques let large language models (LLMs) skip unnecessary layers for “easy” tokens, cutting inference latency. The paper Dynamic Rebatching for Efficient Early‑Exit Inference with DREX shows that conventional batching pipelines waste this opportunity because they treat a batch as a monolith—either all requests exit together or none do. The authors introduce Dynamic Rebatching, a runtime that reshapes the batch on‑the‑fly so each request can exit at the optimal layer without sacrificing throughput or quality.

Key Contributions

Dynamic Rebatching concept – Re‑organizes the batch at every early‑exit checkpoint, immediately finalizing tokens that satisfy the exit condition while buffering the rest for deeper processing.
DREX system – A production‑ready inference engine that implements dynamic rebatching with two performance‑boosting tricks:
1. Copy‑free rebatching buffer – avoids costly memory copies when reshuffling requests.
2. EE‑ and SLA‑aware scheduler – analytically predicts whether a rebatching step will improve overall latency/throughput, preventing harmful reorganizations.
Efficient KV‑cache handling – Provides a memory‑lightweight method to reconstruct missing key‑value cache entries for layers that were skipped, preserving the speed benefits of transformer caching.
Guarantee of no involuntary exits – DREX ensures that a request never exits earlier than the EE model’s own confidence threshold, protecting output quality.
Empirical validation – Demonstrates 2–12 % higher throughput over prior EE batching baselines while keeping the same generation quality.

Methodology

Early‑Exit checkpoints – The model is instrumented with several “exit heads” after selected transformer layers. Each head produces a confidence score; if it exceeds a pre‑tuned threshold, the token can be emitted early.
Dynamic rebatching loop:
- At an exit point, the runtime scans the current batch.
- Tokens that meet the confidence criterion are finalized and removed from further processing.
- The remaining tokens are placed into a rebatching buffer that tracks their original positions but does not copy the underlying tensor data.
- The buffer groups the pending tokens into a new batch (potentially of a different size) and forwards it to the next deeper layer.
Copy‑free buffer design – Uses index‑based indirection (e.g., a vector of pointers/offsets) so the same underlying activation memory can be reused across rebatching steps, eliminating O(N) data movement.
Scheduler analytics – For each potential rebatch, DREX estimates the cost/benefit trade‑off (extra compute vs. saved latency) using a lightweight model of layer latency, batch size scaling, and SLA constraints (e.g., max per‑token latency). If the predicted profit is negative, the scheduler postpones rebatching and keeps the current batch intact.
KV‑cache reconstruction – When a token skips a layer, DREX synthesizes the missing key‑value entries by copying the nearest cached state and applying a cheap linear projection, keeping the cache size bounded.

Results & Findings

Metric	Baseline EE batching	DREX (dynamic rebatching)
Throughput (tokens / s)	1.00× (reference)	1.02–1.12×
Average per‑token latency	120 ms	108–115 ms
Involuntary exit rate	5–12 % of tokens	0 %
Output quality (BLEU / ROUGE)	Baseline	Identical (no degradation)

Throughput gains grow with larger batch sizes and deeper exit points because DREX can keep the GPU busy with a well‑filled batch while still letting early‑exit tokens leave promptly.
Zero involuntary exits eliminates a subtle quality bug that plagued earlier EE systems, where a token would be forced out early just to keep the batch uniform.
Scheduler accuracy: The analytical profit model correctly predicts profitable rebatches >95 % of the time, avoiding unnecessary reshuffles that would otherwise add overhead.

Practical Implications

Deployers of LLM APIs can integrate DREX to reduce cloud‑GPU costs while preserving the quality guarantees promised by early‑exit models.
Latency‑sensitive applications (e.g., real‑time code completion, conversational agents) benefit from lower tail latency because easy tokens finish instantly rather than waiting for the slowest request in the batch.
Framework developers (PyTorch, TensorFlow, Triton) can adopt the copy‑free rebatching buffer pattern to support dynamic batch sizes without incurring memory copy penalties.
SLA‑aware scheduling opens the door to hybrid workloads where some requests have strict latency caps while others prioritize throughput; DREX can automatically balance the two.
KV‑cache reconstruction shows a practical way to keep transformer caching effective even when layers are skipped, a pattern that could be reused for other conditional‑execution models (e.g., Mixture‑of‑Experts).

Limitations & Future Work

Model‑specific tuning – The exit thresholds and scheduler cost model need calibration per model architecture and hardware; a one‑size‑fits‑all configuration is not provided.
GPU memory fragmentation – Although copy‑free, the indirection buffers can lead to scattered memory accesses, which may affect performance on GPUs with limited cache.
Scalability to massive batch sizes – The reported gains plateau beyond ~256‑token batches; further research is needed to maintain benefits at extreme scales.
Extension to multi‑node inference – DREX currently targets single‑node GPU setups; distributed scenarios would require coordinated rebatching across nodes.
Broader EE strategies – The paper focuses on confidence‑based exits; future work could explore hybrid criteria (e.g., token‑level difficulty estimators) and integrate with other conditional execution techniques like adaptive depth or mixture‑of‑experts.

Authors

Xuting Liu
Daniel Alexander
Siva Kesava Reddy Kakarla
Behnaz Arzani
Vincent Liu

Paper Information

arXiv ID: 2512.15705v1
Categories: cs.DC, cs.LG
Published: December 17, 2025
PDF: Download PDF

[Paper] Dynamic Rebatching for Efficient Early-Exit Inference with DREX

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy