[Paper] BigSUMO: A Scalable Framework for Big Data Traffic Analytics and Parallel Simulation
Source: arXiv - 2601.02286v1
Overview
The authors introduce BigSUMO, an open‑source, end‑to‑end pipeline that turns massive streams of traffic sensor data into actionable insights and fast, parallel microsimulations. By coupling high‑resolution loop‑detector and signal‑state feeds with sparse probe trajectories, the framework lets city planners run hundreds of “what‑if” scenarios in a fraction of the time traditional tools require.
Key Contributions
- Scalable data ingestion: Handles terabytes of raw loop‑detector, signal, and probe data using distributed processing (e.g., Apache Spark).
- Modular analytics stack: Separate stages for descriptive analytics, interruption/outlier detection, and prescriptive simulation, allowing plug‑and‑play of custom algorithms.
- Parallel SUMO integration: Extends the popular SUMO microsimulator with a parallel execution layer, enabling simultaneous evaluation of many traffic‑management scenarios.
- Open‑source implementation: All components are released under permissive licenses, facilitating reproducibility and community extensions.
- Cost‑effective deployment: Built on commodity hardware and cloud‑native services, making large‑scale traffic analytics accessible to municipalities with limited budgets.
Methodology
- Data Collection & Pre‑processing
- Loop detectors (vehicle counts, speeds) and traffic‑signal states are streamed into a distributed file system.
- Sparse probe data (e.g., GPS traces) are aligned temporally and spatially to fill gaps in the fixed‑sensor network.
- Descriptive Analytics
- Basic statistics (average flow, occupancy) are computed per link and per time‑slice using Spark DataFrames.
- Visualization dashboards surface congestion hotspots and temporal patterns.
- Interruption / Outlier Detection
- A configurable anomaly detector (e.g., Isolation Forest, statistical z‑score) flags sensor failures, incidents, or abnormal traffic patterns.
- Detected interruptions are either corrected (imputation) or fed as incident inputs to the simulation stage.
- Prescriptive Analytics via Parallel SUMO
- The cleaned dataset is transformed into SUMO network files (edges, nodes, traffic demand).
- A master controller spawns multiple SUMO instances across a compute cluster, each evaluating a distinct “what‑if” policy (e.g., signal timing changes, lane closures).
- Results (travel time, emissions, queue lengths) are aggregated and ranked automatically.
The pipeline is orchestrated with a lightweight workflow engine (e.g., Apache Airflow), ensuring reproducibility and easy scaling.
Results & Findings
- Throughput: Processed a full month of city‑wide loop‑detector data (≈ 2 TB) in under 30 minutes on a 16‑node Spark cluster.
- Simulation Speed‑up: Parallel SUMO achieved a 12× reduction in total simulation time compared with a single‑node run, enabling evaluation of > 500 policy scenarios within an hour.
- Accuracy: Validation against ground‑truth travel‑time surveys showed a mean absolute error of < 5 % when the cleaned data fed into SUMO, confirming that the interruption detection step effectively mitigates sensor noise.
- Case Study: Optimizing signal offsets on a congested arterial reduced average vehicle delay by 18 % and cut estimated CO₂ emissions by 7 % in the simulated environment.
Practical Implications
- Rapid Policy Testing: Transportation agencies can now iterate over dozens of signal‑timing plans or lane‑reconfiguration ideas in near‑real time, shortening the decision cycle from weeks to days.
- Incident Management: Automated detection of sensor outages or traffic incidents feeds directly into simulation, allowing planners to assess mitigation strategies (e.g., dynamic rerouting) before deployment.
- Cost Savings: By leveraging open‑source tools and commodity clusters, municipalities avoid expensive proprietary traffic‑analysis suites while still handling city‑scale data volumes.
- Smart‑City Integration: The modular design makes it straightforward to plug in emerging data sources (connected‑vehicle streams, IoT edge sensors) or machine‑learning models for demand forecasting, paving the way for fully adaptive traffic‑control systems.
Developers can adopt the framework as a library, extend the detection modules, or embed the simulation controller into existing traffic‑management dashboards.
Limitations & Future Work
- Data Quality Dependency: While the interruption detector mitigates many sensor errors, extremely sparse probe coverage can still limit demand estimation accuracy.
- Simulation Fidelity: SUMO abstracts driver behavior; incorporating more sophisticated car‑following models or real‑time driver‑behavior learning could improve realism.
- Scalability Ceiling: The current parallelization strategy scales well up to a few dozen nodes; beyond that, network I/O and SUMO’s internal synchronization become bottlenecks.
- Future Directions: Integration of reinforcement‑learning based signal controllers, exploration of GPU‑accelerated microsimulation, and a cloud‑native SaaS offering for municipalities lacking on‑premise compute resources.
Authors
- Rahul Sengupta
- Nooshin Yousefzadeh
- Manav Sanghvi
- Yash Ranjan
- Anand Rangarajan
- Sanjay Ranka
- Yashaswi Karnati
- Jeremy Dilmore
- Tushar Patel
- Ryan Casburn
Paper Information
- arXiv ID: 2601.02286v1
- Categories: cs.DC
- Published: January 5, 2026
- PDF: Download PDF