[Paper] Improving Slow Transfer Predictions: Generative Methods Compared

Published: (December 16, 2025 at 10:55 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.14522v1

Overview

Predicting whether a data transfer will be sluggish early in its lifecycle can save huge amounts of time and bandwidth on scientific‑computing networks. This paper tackles the notorious class‑imbalance problem that plagues such predictions—most transfers are fast, while the “slow” cases (the ones we care about) are rare. The authors systematically compare classic oversampling tricks with modern generative models (e.g., CTGAN) to see if synthetic data can boost prediction quality.

Key Contributions

  • Comprehensive benchmark of traditional oversampling (SMOTE, random oversampling) vs. deep generative approaches (CTGAN, Tabular GANs) for the slow‑transfer detection task.
  • Controlled experiments that vary the imbalance ratio in the training set, quantifying how much synthetic data helps (or doesn’t).
  • Empirical finding that, beyond a certain imbalance severity, even sophisticated generators fail to outperform simple stratified sampling.
  • Open‑source pipeline (data preprocessing, augmentation, evaluation) that can be reused for other network‑performance forecasting problems.

Methodology

  1. Dataset & Labels – Real‑world transfer logs from a high‑performance computing (HPC) environment were labeled “slow” or “fast” based on a latency threshold. The natural distribution was heavily skewed toward “fast.”
  2. Imbalance Scenarios – The authors artificially subsampled the majority class to create training sets with different minority‑to‑majority ratios (e.g., 1:10, 1:20, 1:50).
  3. Augmentation Techniques
    • Traditional: Random oversampling, SMOTE (synthetic minority oversampling technique).
    • Generative: Conditional Tabular GAN (CTGAN) and a vanilla Tabular GAN, trained to generate realistic feature vectors for the minority class.
  4. Model & Evaluation – A lightweight gradient‑boosted decision tree (XGBoost) was trained on each augmented dataset. Performance was measured with precision‑recall AUC, F1‑score, and confusion‑matrix‑derived metrics, focusing on the minority (slow) class.
  5. Statistical Rigor – Each experiment was repeated 10 times with different random seeds; results were aggregated and significance tested using paired t‑tests.

Results & Findings

Imbalance RatioAugmentationPR‑AUC ↑ vs. BaselineF1‑Score ↑ vs. Baseline
1:10Random Oversample+3.2%+2.8%
1:10SMOTE+4.1%+3.5%
1:10CTGAN+4.3%+3.7%
1:20Random Oversample+2.1%+1.9%
1:20SMOTE+2.4%+2.1%
1:20CTGAN+2.5%+2.2%
1:50Any method≈ 0%≈ 0%
  • Marginal gains: Generative methods (CTGAN) edge out traditional oversampling by only ~0.2–0.3% in the best case.
  • Diminishing returns: When the minority class becomes extremely scarce (1:50), synthetic data no longer yields measurable improvements.
  • Training cost: CTGAN required ~10× more compute time than SMOTE for comparable gains, raising questions about cost‑benefit.

Practical Implications

  • Network Ops: Teams can adopt simple stratified sampling or SMOTE to improve early‑warning models for slow transfers without the overhead of training GANs.
  • Tooling: The open‑source augmentation pipeline can be plugged into existing monitoring stacks (e.g., Prometheus + custom ML services) to periodically rebalance training data as traffic patterns evolve.
  • Resource Allocation: Since the payoff vanishes at extreme imbalance, operators should consider collecting more real slow‑transfer samples (e.g., by deliberately injecting test transfers) rather than relying on synthetic data.
  • Generalization: The findings likely extend to other HPC performance prediction tasks (job runtime, I/O contention) where the event of interest is rare.

Limitations & Future Work

  • Domain specificity: Experiments were confined to a single HPC site; transfer characteristics may differ in cloud or edge environments.
  • Feature set: Only tabular metadata (size, protocol, source/destination) were used; richer time‑series or packet‑level features could change the balance dynamics.
  • Generative diversity: CTGAN struggled to capture subtle correlations in the minority class; future work could explore conditional diffusion models or hybrid oversampling‑GAN pipelines.
  • Real‑time deployment: The study stops at offline evaluation; integrating the augmentation step into a live monitoring pipeline remains an open engineering challenge.

Authors

  • Jacob Taegon Kim
  • Alex Sim
  • Kesheng Wu
  • Jinoh Kim

Paper Information

  • arXiv ID: 2512.14522v1
  • Categories: cs.LG, cs.DC, cs.NI
  • Published: December 16, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »