[Paper] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

Published: 1 day ago (March 2, 2026 at 01:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.02194v1

Overview

The paper From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories examines a hidden problem in autonomous‑vehicle (AV) research: most perception models are judged only by benchmark scores, while their underlying code is rarely inspected for production‑grade quality. By analysing 178 open‑source perception models from the KITTI and NuScenes leaderboards, the authors reveal a stark mismatch between leaderboard success and real‑world deployability.

Key Contributions

Large‑scale empirical audit of 178 AV perception repositories using static analysis tools (Pylint, Bandit, Radon).
Production‑readiness metric: definition of a “ready‑to‑deploy” baseline (zero critical errors + no high‑severity security flaws). Only 7.3 % of repos meet it.
Security vulnerability taxonomy: identification that five vulnerability types account for ~80 % of all high‑severity findings.
Correlation study: CI/CD pipeline adoption is linked to higher maintainability scores.
Actionable guidelines: a concise checklist for developers to avoid the most common pitfalls and improve code quality before shipping.

Methodology

Dataset construction – The authors scraped the KITTI and NuScenes 3‑D object‑detection leaderboards, extracting the GitHub URLs of every unique model (178 in total).
Static analysis pipeline – Each repository was cloned and run through three open‑source linters:
- Pylint for general Python errors and style violations.
- Bandit for security‑related patterns (e.g., unsafe deserialization, hard‑coded credentials).
- Radon for cyclomatic complexity, code duplication, and maintainability index.
Metric aggregation – Errors were classified by severity (critical, high, medium, low). A repository was flagged as “production‑ready” only if it had no critical Pylint errors and no high‑severity Bandit findings.
Statistical analysis – The team compared maintainability scores across projects with and without CI/CD configuration files (.github/workflows, GitLab CI, etc.) using Mann‑Whitney U tests.

The entire workflow is fully reproducible and relies on publicly available tools, making it easy for other teams to replicate or extend the study.

Results & Findings

Aspect	Observation
Production readiness	Only 7.3 % (≈13 repos) satisfy the zero‑critical / zero‑high‑severity rule.
Error distribution	Median of 12 Pylint warnings per repo; 41 % contain at least one critical error.
Security hotspots	Top 5 Bandit issues (e.g., use of `eval`, insecure temporary files, hard‑coded secrets) represent ≈80 % of all high‑severity findings.
Maintainability	Average Radon maintainability index: 62 (on a 0‑100 scale). Projects with CI/CD pipelines score ≈8 points higher (p < 0.01).
CI/CD adoption	Only 22 % of repos declare a CI workflow, yet those that do show markedly fewer critical errors.
Correlation with leaderboard rank	No statistically significant link between benchmark mAP/IoU scores and code quality metrics. High‑performing models can be among the worst‑coded.

These results suggest that “winning” a leaderboard does not guarantee that a model is safe or maintainable enough for integration into a production AV stack.

Practical Implications

For ML engineers: Before integrating a perception model, run the same static analysis suite; treat the “production‑ready” flag as a gating checkpoint.
For DevOps teams: Adopt CI pipelines early (linting, security scans, complexity checks) – the data shows a tangible quality lift.
For safety‑critical system designers: Incorporate code‑quality metrics into the safety case (e.g., ISO 26262, ISO 21448) alongside functional performance.
For open‑source contributors: The paper’s checklist (see below) can be added to repository READMEs to signal readiness to potential adopters.

Quick‑start checklist derived from the authors’ guidelines

Run Pylint & fix all critical warnings (undefined variables, import errors).
Scan with Bandit and remediate every high‑severity issue (replace eval, avoid hard‑coded secrets).
Add a CI workflow that runs the two scans on every PR.
Keep cyclomatic complexity < 10 for most functions (Radon).
Document dependencies and environment (requirements.txt, Dockerfile) to aid reproducibility.

Adopting these steps can push a typical research repo from “demo‑only” to a candidate for real‑world testing.

Limitations & Future Work

Language bias – The study only covered Python repositories; many perception pipelines also involve C++/CUDA components that were not examined.
Static analysis scope – Dynamic security issues (e.g., runtime injection attacks) remain invisible to the tools used.
Benchmark selection – Only KITTI and NuScenes were considered; other datasets (Waymo Open, Argoverse) might exhibit different patterns.
Causality vs. correlation – While CI/CD adoption correlates with better maintainability, the study cannot prove it causes the improvement.

Future research directions include extending the audit to multi‑language stacks, integrating dynamic testing (fuzzing, runtime profiling), and evaluating how code‑quality improvements affect downstream safety certification processes.

Authors

Mateus Karvat
Bram Adams
Sidney Givigi

Paper Information

arXiv ID: 2603.02194v1
Categories: cs.CV, cs.LG, cs.RO, cs.SE
Published: March 2, 2026
PDF: Download PDF

[Paper] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Adaptive Confidence Regularization for Multimodal Failure Detection

[Paper] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

[Paper] Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

[Paper] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance