[Paper] A Granularity Characterization of Task Scheduling Effectiveness
Source: arXiv - 2602.20561v1
Overview
Task‑based runtimes (e.g., OpenMP tasks, Intel TBB, Kokkos) are praised for their ability to automatically balance work across many cores, but their strong‑scaling performance can collapse dramatically as the number of cores grows. This paper introduces a granularity characterization framework that ties the growth of scheduling overhead directly to the dependency topology of the task graph, showing why some algorithms stay efficient while others fall apart when parallelism increases.
Key Contributions
- Granularity metric linked to graph topology – a simple, computable measure that predicts when scheduling overhead will be amortized versus when it will dominate execution time.
- Theoretical model connecting dependency structure (e.g., depth, fan‑in/out) to the scaling of runtime overhead, independent of raw problem size.
- Empirical validation on a suite of representative scientific kernels (stencil, N‑body, linear algebra) demonstrating both gradual and abrupt strong‑scaling breakdowns.
- Practical decision rule for runtime systems to automatically choose between dynamic (task‑based) and static (loop‑based) scheduling without exhaustive tuning.
- Predictive overhead models that can estimate the strong‑scaling limit of a given application from its task‑graph characteristics alone.
Methodology
-
Task‑graph abstraction – The authors model each application as a directed acyclic graph (DAG) where nodes are tasks and edges are data dependencies.
-
Granularity measure – They define a scalar
G = (average task work) / (average scheduling cost per dependency). The numerator is estimated from the computational intensity of a task; the denominator is derived from the number of incoming/outgoing edges (i.e., the amount of bookkeeping the runtime must perform). -
Analytical scaling model – Using
G, they derive a formula for the expected parallel speedup:[ S(p) \approx \frac{p}{1 + \frac{1}{G}\cdot f(\text{topology},p)} ]
where
fcaptures how the number of active dependencies grows with core countp. -
Experimental suite – They implement several kernels with both dynamic task and static loop variants, instrumenting the runtime to capture scheduling time, task execution time, and dependency counts.
-
Validation – Measured speedups are compared against the model predictions across a range of core counts (up to 256 cores on a Xeon‑based cluster).
Results & Findings
- Dependency topology dominates scaling: Two kernels with identical total work but different DAG shapes (e.g., a deep chain vs. a wide fan‑out) exhibit vastly different scaling curves.
- Granularity threshold: When
G > 10(i.e., task work is at least ten times larger than per‑dependency scheduling cost), dynamic scheduling remains beneficial up to >200 cores. Below this threshold, performance drops sharply. - Predictive accuracy: The analytical model predicts the observed strong‑scaling break‑point within ±5 % for all tested workloads.
- Decision rule success: Applying the runtime rule (“switch to static scheduling if
G < 5”) yields up to 2.3× speedup on workloads that would otherwise suffer from scheduling overhead. - Gradual vs. abrupt breakdowns: Wide‑fan‑out graphs show a smooth performance decay, while deep‑chain graphs experience an abrupt collapse once the number of active tasks exceeds the core count.
Practical Implications
- Runtime developers can embed the granularity estimator to automatically toggle between task‑based and loop‑based execution, eliminating the need for manual tuning per application.
- Performance engineers gain a concrete diagnostic: by extracting the task‑graph (many modern runtimes already expose this), they can compute
Gearly in the development cycle and decide whether to refactor the algorithm (e.g., coarsen tasks) or stick with static parallel loops. - Library authors (e.g., for linear algebra or mesh‑based solvers) can expose a “granularity hint” to end‑users, allowing the library to adapt its internal scheduling strategy on the fly.
- Cloud/heterogeneous environments: The model can be extended to account for varying core speeds or accelerator off‑loading, helping orchestrators decide when to keep work on the CPU vs. dispatch as fine‑grained tasks to GPUs.
- Reduced testing overhead: Teams no longer need exhaustive strong‑scaling sweeps for each new kernel; a single DAG analysis suffices to predict scaling limits.
Limitations & Future Work
- Assumes homogeneous cores – The current model does not explicitly handle heterogeneous architectures (e.g., CPU + GPU) where scheduling costs differ per device.
- Static cost estimation – The per‑dependency scheduling cost is measured once per runtime; variations due to contention or NUMA effects are not modeled.
- Focus on DAG‑structured workloads – Irregular, dynamically evolving graphs (e.g., adaptive mesh refinement) may require extensions to capture runtime graph mutations.
- Future directions proposed include: integrating NUMA‑aware cost models, extending the framework to handle adaptive task creation, and exploring machine‑learning‑augmented granularity predictors for even finer runtime decisions.
Authors
- Sana Taghipour Anvar
- David Kaeli
Paper Information
- arXiv ID: 2602.20561v1
- Categories: cs.DC
- Published: February 24, 2026
- PDF: Download PDF