[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale
Source: arXiv - 2605.08084v1
Overview
The paper introduces 123D, an open‑source framework that brings together dozens of autonomous‑driving datasets—both real‑world and synthetic—under a single, easy‑to‑use Python API. By normalising the wildly different sensor streams (cameras, LiDAR, ego‑poses, HD maps, etc.) and annotation conventions, 123D makes it practical for developers to train, evaluate, and transfer models across datasets that previously could not be mixed.
Key Contributions
- Unified data model: Represents every sensor modality as an independent, timestamped event stream, eliminating the need for a common frame rate or rigid synchronization.
- Cross‑dataset API: A single Python interface (
py123d) that can load and query eight large‑scale real‑world driving datasets (≈3,300 h, 90,000 km) plus a configurable synthetic suite. - Annotation harmonisation: Systematic mapping of disparate labeling schemes (e.g., object classes, traffic‑light states) into a common taxonomy, enabling fair cross‑dataset training and benchmarking.
- Tooling for analysis & visualisation: Built‑in utilities for pose/calibration sanity checks, statistical reports, and interactive 3‑D visualisers.
- Demonstrations of downstream impact:
- Cross‑dataset 3‑D object detection transfer learning that improves performance on low‑resource datasets.
- Reinforcement‑learning‑based planning experiments that leverage the unified data to train policies with richer scenario diversity.
Methodology
- Event‑stream abstraction – Each sensor (e.g., front‑camera, 64‑beam LiDAR) is stored as a series of
(timestamp, payload)pairs. No global clock is imposed; developers can request synchronous samples (e.g., nearest‑neighbor interpolation) or work asynchronously (e.g., process LiDAR at 10 Hz while cameras run at 30 Hz). - Dataset adapters – For every supported source (e.g., Waymo Open, nuScenes, Argoverse), a thin adapter parses the original files and populates the unified event streams. The adapters also translate original calibration matrices and ego‑poses into a common coordinate frame.
- Annotation unification – A hierarchical class taxonomy (vehicle → car, truck, bus; pedestrian → person, cyclist, etc.) is defined. The adapters map source‑specific labels to this taxonomy and normalise attributes such as bounding‑box orientation, velocity, and traffic‑light state.
- Quality assessment pipeline – The authors run pose‑error checks (e.g., loop‑closure consistency), calibration sanity tests (re‑projecting LiDAR points onto images), and statistical audits (class distribution, sensor coverage). Results are stored as metadata for downstream consumers.
- API design –
py123dexposes high‑level functions likeload_scene(),get_sensor_data(sensor_name, timestamp), andquery_annotations(filter). The API is deliberately framework‑agnostic, so data can be fed into PyTorch, TensorFlow, JAX, or RL libraries without extra glue code.
Results & Findings
| Study | Insight |
|---|---|
| Annotation statistics | Synthetic data exhibits a more balanced class distribution, while real datasets are heavily skewed toward cars and road‑side objects. |
| Pose & calibration accuracy | Waymo and nuScenes show sub‑centimeter ego‑pose drift, whereas older datasets (e.g., KITTI) have noticeable yaw errors that can degrade 3‑D detection if left uncorrected. |
| Cross‑dataset 3‑D detection | Pre‑training on the unified pool (all real + synthetic) and fine‑tuning on a target dataset improves mAP by 5–9 % over training on the target alone, especially for rare classes like bicycles and traffic signs. |
| RL planning transfer | Policies trained with scenarios drawn from multiple datasets learn more robust obstacle‑avoidance behaviours, reducing collision rates in simulation by ≈12 % compared to single‑dataset training. |
Practical Implications
- Faster prototyping – Teams no longer need to write bespoke parsers for each dataset; a few lines of code bring any supported data source into their training pipeline.
- Data‑centric AI – By making it trivial to mix real and synthetic data, developers can augment scarce edge‑case scenarios (e.g., adverse weather, rare traffic‑light configurations) without manual data collection.
- Benchmark standardisation – Researchers can now report a single “cross‑dataset” metric, encouraging models that generalise beyond a single benchmark and reducing the “benchmark‑overfitting” problem.
- Calibration sanity checks – The built‑in quality tools help engineers catch sensor‑misalignment bugs early, saving costly re‑recording efforts.
- Open‑source ecosystem – The GitHub repo includes Docker images, Jupyter notebooks, and integration hooks for popular ML frameworks, lowering the barrier for startups and academic labs to adopt a multi‑modal data strategy.
Limitations & Future Work
- Coverage – While eight major real‑world datasets are supported, many emerging collections (e.g., regional fleets, proprietary OEM data) still require custom adapters.
- Temporal alignment – The event‑stream model assumes timestamps are reliable; datasets with poorly synchronised clocks may need additional post‑processing.
- Synthetic realism gap – Although the synthetic suite is configurable, bridging the domain gap to real sensor noise remains an open challenge.
- Scalability – Storing all modalities as independent streams can increase storage overhead; future work will explore on‑the‑fly compression and cloud‑native streaming.
- Extending to V2X – The authors plan to incorporate vehicle‑to‑everything communication data (e.g., DSRC, C‑V2X) to enable cooperative perception research.
The 123D framework promises to turn the fragmented world of autonomous‑driving datasets into a cohesive playground for developers, accelerating research and bringing more robust, data‑driven solutions to the road.
Authors
- Daniel Dauner
- Valentin Charraut
- Bastian Berle
- Tianyu Li
- Long Nguyen
- Jiabao Wang
- Changhui Jing
- Maximilian Igl
- Holger Caesar
- Boris Ivanovic
- Yiyi Liao
- Andreas Geiger
- Kashyap Chitta
Paper Information
- arXiv ID: 2605.08084v1
- Categories: cs.RO, cs.CV
- Published: May 8, 2026
- PDF: Download PDF