[Paper] High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

Published: 1 month ago (December 11, 2025 at 01:02 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.10312v1

Overview

The paper chronicles a hands‑on exploration of machine‑learning (ML) and deep‑learning (DL) pipelines applied to high‑dimensional data, ranging from numeric benchmarks (the Epsilon dataset) to real‑world text and multimedia corpora. By juxtaposing local (single‑node) and distributed (Apache Spark) environments, the authors expose the trade‑offs developers face when scaling analytics workloads.

Key Contributions

End‑to‑end benchmark suite covering classic ML (linear models, tree‑based methods) and modern DL (feed‑forward nets) on the high‑dimensional Epsilon dataset.
Comparative performance analysis of local (CPU/GPU) versus Spark‑based distributed execution for both training and inference.
Applied case studies:
- Text classification using the RestMex corpus (Spanish‑language reviews).
- Feature extraction and recommendation modeling on the IMDb movie dataset.
Practical guide for provisioning a Spark cluster on Linux with Scala, including scripts for data ingestion, model serialization, and job scheduling.
Open‑source artifacts (datasets, notebooks, and Spark jobs) released under a permissive license for reproducibility.

Methodology

Dataset Preparation
- Epsilon: 400 K samples × 2 000 features, pre‑processed with standard scaling.
- RestMex: scraped restaurant reviews, tokenized, and transformed into TF‑IDF vectors.
- IMDb: extracted plot summaries, cast lists, and ratings; encoded via word embeddings and one‑hot encodings.
Model Portfolio
- ML: Logistic regression, SVM, Random Forest, Gradient Boosting.
- DL: Multi‑layer perceptron (MLP) with ReLU activations, a shallow CNN for text, and a hybrid MLP‑embedding model for movies.
Execution Environments
- Local: Python (scikit‑learn, TensorFlow) on a workstation with 32 GB RAM, 8‑core CPU, and optional NVIDIA GPU.
- Distributed: Spark 3.x cluster (3 worker nodes, each 8 vCPU, 32 GB RAM) using Spark MLlib and Spark‑TensorFlow integration via TensorFlowOnSpark.
Evaluation Metrics
- Training time, peak memory usage, and model accuracy/F1‑score.
- Scalability measured by varying Spark executor counts and data partitions.
Reproducibility
- All experiments scripted in Bash/Scala notebooks; Docker images provided for the local stack; Ansible playbooks for cluster setup.

Results & Findings

Task	Best Local Model	Best Distributed Model	Speed‑up (Distributed vs. Local)
Epsilon classification (accuracy)	Gradient Boosting (0.93)	Spark‑ML Gradient Boosting (0.92)	4.2× (training)
RestMex sentiment (F1)	CNN (0.88)	TensorFlowOnSpark CNN (0.87)	3.7×
IMDb recommendation (RMSE)	MLP (0.71)	Spark‑ML MLP (0.72)	5.1×

Training time dropped dramatically in the distributed setting, especially for the 400 K × 2 000 feature matrix where Spark’s data parallelism shone.
Model quality remained within 1–2 % of the local baseline, confirming that Spark’s approximations (e.g., histogram‑based splits) do not sacrifice predictive power.
Memory footprint per node stayed under 70 % of available RAM, demonstrating that the pipeline scales without exhausting resources.

Practical Implications

For data engineers: The Spark‑centric scripts provide a ready‑made template to spin up a fault‑tolerant pipeline for any high‑dimensional tabular data, cutting weeks of boilerplate code.
For ML practitioners: The benchmark shows that you can safely offload heavy training to a modest Spark cluster without losing accuracy, freeing up local GPUs for experimentation or serving.
For product teams: Real‑world case studies (restaurant sentiment, movie recommendation) illustrate how to integrate text and metadata pipelines into micro‑services that can be deployed on Kubernetes with Spark‑operator.
Cost‑benefit insight: The authors estimate a 30 % reduction in total compute cost when using spot‑instance Spark workers versus a continuously running GPU workstation for large batch jobs.

Limitations & Future Work

Dataset size: Experiments stop at ~400 K rows; scaling to multi‑hundred‑million records may expose new bottlenecks (shuffle overhead, driver memory).
Model diversity: Only shallow DL architectures were tested; future work could benchmark transformer‑based models (BERT, GPT) in Spark.
Latency focus: The study emphasizes batch training speed; real‑time inference latency on Spark Structured Streaming remains unexamined.
Hardware heterogeneity: All nodes used homogeneous CPU resources; exploring mixed CPU‑GPU clusters could further accelerate DL workloads.

Bottom line: This paper delivers a practical, reproducible roadmap for developers who need to decide whether to keep ML/DL workloads on a single machine or to migrate them to a Spark cluster. The clear performance gains, coupled with minimal loss in model quality, make a compelling case for distributed training in many enterprise AI pipelines.

Authors

Julian Rodriguez
Piotr Lopez
Emiliano Lerma
Rafael Medrano
Jacobo Hernandez

Paper Information

arXiv ID: 2512.10312v1
Categories: cs.DC, cs.AI
Published: December 11, 2025
PDF: Download PDF

[Paper] High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously