[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

Published: 3 days ago (May 8, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.08084v1

Overview

The paper introduces 123D, an open‑source framework that brings together dozens of autonomous‑driving datasets—both real‑world and synthetic—under a single, easy‑to‑use Python API. By normalising the wildly different sensor streams (cameras, LiDAR, ego‑poses, HD maps, etc.) and annotation conventions, 123D makes it practical for developers to train, evaluate, and transfer models across datasets that previously could not be mixed.

Key Contributions

Unified data model: Represents every sensor modality as an independent, timestamped event stream, eliminating the need for a common frame rate or rigid synchronization.
Cross‑dataset API: A single Python interface (py123d) that can load and query eight large‑scale real‑world driving datasets (≈3,300 h, 90,000 km) plus a configurable synthetic suite.
Annotation harmonisation: Systematic mapping of disparate labeling schemes (e.g., object classes, traffic‑light states) into a common taxonomy, enabling fair cross‑dataset training and benchmarking.
Tooling for analysis & visualisation: Built‑in utilities for pose/calibration sanity checks, statistical reports, and interactive 3‑D visualisers.
Demonstrations of downstream impact:
1. Cross‑dataset 3‑D object detection transfer learning that improves performance on low‑resource datasets.
2. Reinforcement‑learning‑based planning experiments that leverage the unified data to train policies with richer scenario diversity.

Methodology

Event‑stream abstraction – Each sensor (e.g., front‑camera, 64‑beam LiDAR) is stored as a series of (timestamp, payload) pairs. No global clock is imposed; developers can request synchronous samples (e.g., nearest‑neighbor interpolation) or work asynchronously (e.g., process LiDAR at 10 Hz while cameras run at 30 Hz).
Dataset adapters – For every supported source (e.g., Waymo Open, nuScenes, Argoverse), a thin adapter parses the original files and populates the unified event streams. The adapters also translate original calibration matrices and ego‑poses into a common coordinate frame.
Annotation unification – A hierarchical class taxonomy (vehicle → car, truck, bus; pedestrian → person, cyclist, etc.) is defined. The adapters map source‑specific labels to this taxonomy and normalise attributes such as bounding‑box orientation, velocity, and traffic‑light state.
Quality assessment pipeline – The authors run pose‑error checks (e.g., loop‑closure consistency), calibration sanity tests (re‑projecting LiDAR points onto images), and statistical audits (class distribution, sensor coverage). Results are stored as metadata for downstream consumers.
API design – py123d exposes high‑level functions like load_scene(), get_sensor_data(sensor_name, timestamp), and query_annotations(filter). The API is deliberately framework‑agnostic, so data can be fed into PyTorch, TensorFlow, JAX, or RL libraries without extra glue code.

Results & Findings

Study	Insight
Annotation statistics	Synthetic data exhibits a more balanced class distribution, while real datasets are heavily skewed toward cars and road‑side objects.
Pose & calibration accuracy	Waymo and nuScenes show sub‑centimeter ego‑pose drift, whereas older datasets (e.g., KITTI) have noticeable yaw errors that can degrade 3‑D detection if left uncorrected.
Cross‑dataset 3‑D detection	Pre‑training on the unified pool (all real + synthetic) and fine‑tuning on a target dataset improves mAP by 5–9 % over training on the target alone, especially for rare classes like bicycles and traffic signs.
RL planning transfer	Policies trained with scenarios drawn from multiple datasets learn more robust obstacle‑avoidance behaviours, reducing collision rates in simulation by ≈12 % compared to single‑dataset training.

Practical Implications

Faster prototyping – Teams no longer need to write bespoke parsers for each dataset; a few lines of code bring any supported data source into their training pipeline.
Data‑centric AI – By making it trivial to mix real and synthetic data, developers can augment scarce edge‑case scenarios (e.g., adverse weather, rare traffic‑light configurations) without manual data collection.
Benchmark standardisation – Researchers can now report a single “cross‑dataset” metric, encouraging models that generalise beyond a single benchmark and reducing the “benchmark‑overfitting” problem.
Calibration sanity checks – The built‑in quality tools help engineers catch sensor‑misalignment bugs early, saving costly re‑recording efforts.
Open‑source ecosystem – The GitHub repo includes Docker images, Jupyter notebooks, and integration hooks for popular ML frameworks, lowering the barrier for startups and academic labs to adopt a multi‑modal data strategy.

Limitations & Future Work

Coverage – While eight major real‑world datasets are supported, many emerging collections (e.g., regional fleets, proprietary OEM data) still require custom adapters.
Temporal alignment – The event‑stream model assumes timestamps are reliable; datasets with poorly synchronised clocks may need additional post‑processing.
Synthetic realism gap – Although the synthetic suite is configurable, bridging the domain gap to real sensor noise remains an open challenge.
Scalability – Storing all modalities as independent streams can increase storage overhead; future work will explore on‑the‑fly compression and cloud‑native streaming.
Extending to V2X – The authors plan to incorporate vehicle‑to‑everything communication data (e.g., DSRC, C‑V2X) to enable cooperative perception research.

The 123D framework promises to turn the fragmented world of autonomous‑driving datasets into a cohesive playground for developers, accelerating research and bringing more robust, data‑driven solutions to the road.

Authors

Daniel Dauner
Valentin Charraut
Bastian Berle
Tianyu Li
Long Nguyen
Jiabao Wang
Changhui Jing
Maximilian Igl
Holger Caesar
Boris Ivanovic
Yiyi Liao
Andreas Geiger
Kashyap Chitta

Paper Information

arXiv ID: 2605.08084v1
Categories: cs.RO, cs.CV
Published: May 8, 2026
PDF: Download PDF

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

[Paper] Flow-OPD: On-Policy Distillation for Flow Matching Models