[Paper] Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks
Source: arXiv - 2512.21315v1
Overview
The paper investigates a long‑standing information‑theoretic rule—the Data Processing Inequality (DPI)—which says that no amount of preprocessing can increase the information useful for a downstream task like classification. While DPI holds for the optimal Bayes classifier, modern deep learning pipelines routinely apply “low‑level” steps (denoising, compression, feature extraction) before the final classifier. The authors ask: When does such preprocessing actually help real‑world models? They provide a blend of theory and experiments that shows low‑level processing can improve accuracy whenever training data are limited, noisy, or imbalanced.
Key Contributions
- Theoretical proof that for any finite training set, there exists a preprocessing transformation that strictly improves the accuracy of a classifier that asymptotically approaches the Bayes optimal decision rule.
- Analytical characterization of how the gain from preprocessing depends on class separation, dataset size, and class balance.
- Empirical validation on synthetic binary classification tasks that mirrors the theoretical setup, confirming the predicted trends.
- Large‑scale experiments with modern deep neural networks (CNNs, Vision Transformers) on benchmark vision datasets, demonstrating that denoising and encoding can boost performance under realistic constraints (small/imbalanced training sets, high noise).
- Practical guidelines for when to invest in low‑level processing versus relying solely on end‑to‑end learning.
Methodology
-
Problem formulation – Binary classification with a data distribution (p(x, y)). The classifier is assumed to be “Bayes‑connected”: as the number of labeled examples (n) grows, its decision boundary converges to the Bayes optimal one.
-
Theoretical analysis – Using finite‑sample statistical learning bounds, the authors construct a preprocessing map (T(\cdot)) (e.g., a denoiser or encoder) that reduces the variance of the empirical risk estimator, thereby improving the finite‑sample error. They prove that for any finite (n) there exists such a (T) that yields a strictly lower misclassification probability.
-
Synthetic experiments – They generate 2‑D Gaussian mixtures with controllable class overlap, noise level, and class priors. Different preprocessing functions (Gaussian smoothing, PCA compression) are applied before training a logistic regression model that mimics the Bayes‑connected classifier.
-
Deep‑learning benchmarks – Standard vision datasets (CIFAR‑10, ImageNet subsets) are corrupted with additive Gaussian noise. The authors compare three pipelines:
- (a) raw images → deep classifier
- (b) denoised images → deep classifier
- (c) encoded (e.g., JPEG‑compressed) images → deep classifier
Training set size and class balance are systematically varied.
Results & Findings
| Scenario | Raw pipeline accuracy | With preprocessing | Observed gain |
|---|---|---|---|
| Small training set (≤ 5 k samples) | 68 % | +2–5 % after denoising | Consistent with theory |
| Highly imbalanced (1 : 9) | 61 % | +3 % after class‑aware encoding | Improves minority class recall |
| High noise (σ = 0.5) | 55 % | +7 % after Gaussian denoising | Larger gains when noise dominates |
| Large training set (≥ 100 k) | 84 % | ≈ 0 % (no gain) | DPI effect resurfaces asymptotically |
Key take‑aways
- Finite‑sample regime: preprocessing reduces variance of the empirical risk, giving a measurable boost.
- Class separation matters: when classes are already well separated, the benefit shrinks.
- Noise level is a driver: stronger corruption amplifies the advantage of denoising.
- As training data approach infinity, the advantage disappears, aligning with the classic DPI statement.
Practical Implications
- Data‑starved projects (e.g., medical imaging with limited annotated scans) can profit from a lightweight denoising or compression front‑end before fine‑tuning a deep model.
- Edge‑device deployments often operate under bandwidth or storage constraints; applying an encoder (JPEG, WebP) that also serves as a regularizer can improve downstream accuracy without extra compute.
- Imbalanced datasets benefit from class‑aware preprocessing (e.g., oversampling after denoising) that equalizes the effective signal‑to‑noise ratio across classes.
- Pipeline design: Instead of “end‑to‑end everything,” teams should evaluate a modest preprocessing stage when the training regime is constrained, as the cost (CPU/GPU time) is usually negligible compared to the potential accuracy lift.
- Model‑agnostic: The theoretical results hold for any classifier that converges to Bayes optimality, so the insights apply to logistic regression, SVMs, and modern deep nets alike.
Limitations & Future Work
- The formal proof assumes a binary classification setting and a classifier that is tightly coupled to the Bayes rule; extending to multi‑class or structured outputs remains open.
- The constructed preprocessing map (T) is existential; the paper does not provide a universal recipe for finding the optimal (T) in arbitrary domains.
- Experiments focus on Gaussian noise and standard image compression; other realistic corruptions (motion blur, sensor artifacts) need separate investigation.
- Future research could explore learned preprocessing (e.g., trainable denoisers) that adapt jointly with the classifier, and assess the trade‑off between additional parameters and the finite‑sample gains demonstrated here.
Authors
- Roy Turgeman
- Tom Tirer
Paper Information
- arXiv ID: 2512.21315v1
- Categories: cs.LG, cs.CV, stat.ML
- Published: December 24, 2025
- PDF: Download PDF