[Paper] High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments
Source: arXiv - 2512.10312v1
Overview
The paper chronicles a hands‑on exploration of machine‑learning (ML) and deep‑learning (DL) pipelines applied to high‑dimensional data, ranging from numeric benchmarks (the Epsilon dataset) to real‑world text and multimedia corpora. By juxtaposing local (single‑node) and distributed (Apache Spark) environments, the authors expose the trade‑offs developers face when scaling analytics workloads.
Key Contributions
- End‑to‑end benchmark suite covering classic ML (linear models, tree‑based methods) and modern DL (feed‑forward nets) on the high‑dimensional Epsilon dataset.
- Comparative performance analysis of local (CPU/GPU) versus Spark‑based distributed execution for both training and inference.
- Applied case studies:
- Text classification using the RestMex corpus (Spanish‑language reviews).
- Feature extraction and recommendation modeling on the IMDb movie dataset.
- Practical guide for provisioning a Spark cluster on Linux with Scala, including scripts for data ingestion, model serialization, and job scheduling.
- Open‑source artifacts (datasets, notebooks, and Spark jobs) released under a permissive license for reproducibility.
Methodology
-
Dataset Preparation
- Epsilon: 400 K samples × 2 000 features, pre‑processed with standard scaling.
- RestMex: scraped restaurant reviews, tokenized, and transformed into TF‑IDF vectors.
- IMDb: extracted plot summaries, cast lists, and ratings; encoded via word embeddings and one‑hot encodings.
-
Model Portfolio
- ML: Logistic regression, SVM, Random Forest, Gradient Boosting.
- DL: Multi‑layer perceptron (MLP) with ReLU activations, a shallow CNN for text, and a hybrid MLP‑embedding model for movies.
-
Execution Environments
- Local: Python (scikit‑learn, TensorFlow) on a workstation with 32 GB RAM, 8‑core CPU, and optional NVIDIA GPU.
- Distributed: Spark 3.x cluster (3 worker nodes, each 8 vCPU, 32 GB RAM) using Spark MLlib and Spark‑TensorFlow integration via TensorFlowOnSpark.
-
Evaluation Metrics
- Training time, peak memory usage, and model accuracy/F1‑score.
- Scalability measured by varying Spark executor counts and data partitions.
-
Reproducibility
- All experiments scripted in Bash/Scala notebooks; Docker images provided for the local stack; Ansible playbooks for cluster setup.
Results & Findings
| Task | Best Local Model | Best Distributed Model | Speed‑up (Distributed vs. Local) |
|---|---|---|---|
| Epsilon classification (accuracy) | Gradient Boosting (0.93) | Spark‑ML Gradient Boosting (0.92) | 4.2× (training) |
| RestMex sentiment (F1) | CNN (0.88) | TensorFlowOnSpark CNN (0.87) | 3.7× |
| IMDb recommendation (RMSE) | MLP (0.71) | Spark‑ML MLP (0.72) | 5.1× |
- Training time dropped dramatically in the distributed setting, especially for the 400 K × 2 000 feature matrix where Spark’s data parallelism shone.
- Model quality remained within 1–2 % of the local baseline, confirming that Spark’s approximations (e.g., histogram‑based splits) do not sacrifice predictive power.
- Memory footprint per node stayed under 70 % of available RAM, demonstrating that the pipeline scales without exhausting resources.
Practical Implications
- For data engineers: The Spark‑centric scripts provide a ready‑made template to spin up a fault‑tolerant pipeline for any high‑dimensional tabular data, cutting weeks of boilerplate code.
- For ML practitioners: The benchmark shows that you can safely offload heavy training to a modest Spark cluster without losing accuracy, freeing up local GPUs for experimentation or serving.
- For product teams: Real‑world case studies (restaurant sentiment, movie recommendation) illustrate how to integrate text and metadata pipelines into micro‑services that can be deployed on Kubernetes with Spark‑operator.
- Cost‑benefit insight: The authors estimate a 30 % reduction in total compute cost when using spot‑instance Spark workers versus a continuously running GPU workstation for large batch jobs.
Limitations & Future Work
- Dataset size: Experiments stop at ~400 K rows; scaling to multi‑hundred‑million records may expose new bottlenecks (shuffle overhead, driver memory).
- Model diversity: Only shallow DL architectures were tested; future work could benchmark transformer‑based models (BERT, GPT) in Spark.
- Latency focus: The study emphasizes batch training speed; real‑time inference latency on Spark Structured Streaming remains unexamined.
- Hardware heterogeneity: All nodes used homogeneous CPU resources; exploring mixed CPU‑GPU clusters could further accelerate DL workloads.
Bottom line: This paper delivers a practical, reproducible roadmap for developers who need to decide whether to keep ML/DL workloads on a single machine or to migrate them to a Spark cluster. The clear performance gains, coupled with minimal loss in model quality, make a compelling case for distributed training in many enterprise AI pipelines.
Authors
- Julian Rodriguez
- Piotr Lopez
- Emiliano Lerma
- Rafael Medrano
- Jacobo Hernandez
Paper Information
- arXiv ID: 2512.10312v1
- Categories: cs.DC, cs.AI
- Published: December 11, 2025
- PDF: Download PDF