[Paper] Improving Slow Transfer Predictions: Generative Methods Compared

Published: 1 month ago (December 16, 2025 at 10:55 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.14522v1

Overview

Predicting whether a data transfer will be sluggish early in its lifecycle can save huge amounts of time and bandwidth on scientific‑computing networks. This paper tackles the notorious class‑imbalance problem that plagues such predictions—most transfers are fast, while the “slow” cases (the ones we care about) are rare. The authors systematically compare classic oversampling tricks with modern generative models (e.g., CTGAN) to see if synthetic data can boost prediction quality.

Key Contributions

Comprehensive benchmark of traditional oversampling (SMOTE, random oversampling) vs. deep generative approaches (CTGAN, Tabular GANs) for the slow‑transfer detection task.
Controlled experiments that vary the imbalance ratio in the training set, quantifying how much synthetic data helps (or doesn’t).
Empirical finding that, beyond a certain imbalance severity, even sophisticated generators fail to outperform simple stratified sampling.
Open‑source pipeline (data preprocessing, augmentation, evaluation) that can be reused for other network‑performance forecasting problems.

Methodology

Dataset & Labels – Real‑world transfer logs from a high‑performance computing (HPC) environment were labeled “slow” or “fast” based on a latency threshold. The natural distribution was heavily skewed toward “fast.”
Imbalance Scenarios – The authors artificially subsampled the majority class to create training sets with different minority‑to‑majority ratios (e.g., 1:10, 1:20, 1:50).
Augmentation Techniques
- Traditional: Random oversampling, SMOTE (synthetic minority oversampling technique).
- Generative: Conditional Tabular GAN (CTGAN) and a vanilla Tabular GAN, trained to generate realistic feature vectors for the minority class.
Model & Evaluation – A lightweight gradient‑boosted decision tree (XGBoost) was trained on each augmented dataset. Performance was measured with precision‑recall AUC, F1‑score, and confusion‑matrix‑derived metrics, focusing on the minority (slow) class.
Statistical Rigor – Each experiment was repeated 10 times with different random seeds; results were aggregated and significance tested using paired t‑tests.

Results & Findings

Imbalance Ratio	Augmentation	PR‑AUC ↑ vs. Baseline	F1‑Score ↑ vs. Baseline
1:10	Random Oversample	+3.2%	+2.8%
1:10	SMOTE	+4.1%	+3.5%
1:10	CTGAN	+4.3%	+3.7%
1:20	Random Oversample	+2.1%	+1.9%
1:20	SMOTE	+2.4%	+2.1%
1:20	CTGAN	+2.5%	+2.2%
1:50	Any method	≈ 0%	≈ 0%

Marginal gains: Generative methods (CTGAN) edge out traditional oversampling by only ~0.2–0.3% in the best case.
Diminishing returns: When the minority class becomes extremely scarce (1:50), synthetic data no longer yields measurable improvements.
Training cost: CTGAN required ~10× more compute time than SMOTE for comparable gains, raising questions about cost‑benefit.

Practical Implications

Network Ops: Teams can adopt simple stratified sampling or SMOTE to improve early‑warning models for slow transfers without the overhead of training GANs.
Tooling: The open‑source augmentation pipeline can be plugged into existing monitoring stacks (e.g., Prometheus + custom ML services) to periodically rebalance training data as traffic patterns evolve.
Resource Allocation: Since the payoff vanishes at extreme imbalance, operators should consider collecting more real slow‑transfer samples (e.g., by deliberately injecting test transfers) rather than relying on synthetic data.
Generalization: The findings likely extend to other HPC performance prediction tasks (job runtime, I/O contention) where the event of interest is rare.

Limitations & Future Work

Domain specificity: Experiments were confined to a single HPC site; transfer characteristics may differ in cloud or edge environments.
Feature set: Only tabular metadata (size, protocol, source/destination) were used; richer time‑series or packet‑level features could change the balance dynamics.
Generative diversity: CTGAN struggled to capture subtle correlations in the minority class; future work could explore conditional diffusion models or hybrid oversampling‑GAN pipelines.
Real‑time deployment: The study stops at offline evaluation; integrating the augmentation step into a live monitoring pipeline remains an open engineering challenge.

Authors

Jacob Taegon Kim
Alex Sim
Kesheng Wu
Jinoh Kim

Paper Information

arXiv ID: 2512.14522v1
Categories: cs.LG, cs.DC, cs.NI
Published: December 16, 2025
PDF: Download PDF

[Paper] Improving Slow Transfer Predictions: Generative Methods Compared

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy