[Paper] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
Source: arXiv - 2603.02194v1
Overview
The paper From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories examines a hidden problem in autonomous‑vehicle (AV) research: most perception models are judged only by benchmark scores, while their underlying code is rarely inspected for production‑grade quality. By analysing 178 open‑source perception models from the KITTI and NuScenes leaderboards, the authors reveal a stark mismatch between leaderboard success and real‑world deployability.
Key Contributions
- Large‑scale empirical audit of 178 AV perception repositories using static analysis tools (Pylint, Bandit, Radon).
- Production‑readiness metric: definition of a “ready‑to‑deploy” baseline (zero critical errors + no high‑severity security flaws). Only 7.3 % of repos meet it.
- Security vulnerability taxonomy: identification that five vulnerability types account for ~80 % of all high‑severity findings.
- Correlation study: CI/CD pipeline adoption is linked to higher maintainability scores.
- Actionable guidelines: a concise checklist for developers to avoid the most common pitfalls and improve code quality before shipping.
Methodology
- Dataset construction – The authors scraped the KITTI and NuScenes 3‑D object‑detection leaderboards, extracting the GitHub URLs of every unique model (178 in total).
- Static analysis pipeline – Each repository was cloned and run through three open‑source linters:
- Pylint for general Python errors and style violations.
- Bandit for security‑related patterns (e.g., unsafe deserialization, hard‑coded credentials).
- Radon for cyclomatic complexity, code duplication, and maintainability index.
- Metric aggregation – Errors were classified by severity (critical, high, medium, low). A repository was flagged as “production‑ready” only if it had no critical Pylint errors and no high‑severity Bandit findings.
- Statistical analysis – The team compared maintainability scores across projects with and without CI/CD configuration files (
.github/workflows,GitLab CI, etc.) using Mann‑Whitney U tests.
The entire workflow is fully reproducible and relies on publicly available tools, making it easy for other teams to replicate or extend the study.
Results & Findings
| Aspect | Observation |
|---|---|
| Production readiness | Only 7.3 % (≈13 repos) satisfy the zero‑critical / zero‑high‑severity rule. |
| Error distribution | Median of 12 Pylint warnings per repo; 41 % contain at least one critical error. |
| Security hotspots | Top 5 Bandit issues (e.g., use of eval, insecure temporary files, hard‑coded secrets) represent ≈80 % of all high‑severity findings. |
| Maintainability | Average Radon maintainability index: 62 (on a 0‑100 scale). Projects with CI/CD pipelines score ≈8 points higher (p < 0.01). |
| CI/CD adoption | Only 22 % of repos declare a CI workflow, yet those that do show markedly fewer critical errors. |
| Correlation with leaderboard rank | No statistically significant link between benchmark mAP/IoU scores and code quality metrics. High‑performing models can be among the worst‑coded. |
These results suggest that “winning” a leaderboard does not guarantee that a model is safe or maintainable enough for integration into a production AV stack.
Practical Implications
- For ML engineers: Before integrating a perception model, run the same static analysis suite; treat the “production‑ready” flag as a gating checkpoint.
- For DevOps teams: Adopt CI pipelines early (linting, security scans, complexity checks) – the data shows a tangible quality lift.
- For safety‑critical system designers: Incorporate code‑quality metrics into the safety case (e.g., ISO 26262, ISO 21448) alongside functional performance.
- For open‑source contributors: The paper’s checklist (see below) can be added to repository READMEs to signal readiness to potential adopters.
Quick‑start checklist derived from the authors’ guidelines
- Run Pylint & fix all critical warnings (undefined variables, import errors).
- Scan with Bandit and remediate every high‑severity issue (replace
eval, avoid hard‑coded secrets). - Add a CI workflow that runs the two scans on every PR.
- Keep cyclomatic complexity < 10 for most functions (Radon).
- Document dependencies and environment (requirements.txt, Dockerfile) to aid reproducibility.
Adopting these steps can push a typical research repo from “demo‑only” to a candidate for real‑world testing.
Limitations & Future Work
- Language bias – The study only covered Python repositories; many perception pipelines also involve C++/CUDA components that were not examined.
- Static analysis scope – Dynamic security issues (e.g., runtime injection attacks) remain invisible to the tools used.
- Benchmark selection – Only KITTI and NuScenes were considered; other datasets (Waymo Open, Argoverse) might exhibit different patterns.
- Causality vs. correlation – While CI/CD adoption correlates with better maintainability, the study cannot prove it causes the improvement.
Future research directions include extending the audit to multi‑language stacks, integrating dynamic testing (fuzzing, runtime profiling), and evaluating how code‑quality improvements affect downstream safety certification processes.
Authors
- Mateus Karvat
- Bram Adams
- Sidney Givigi
Paper Information
- arXiv ID: 2603.02194v1
- Categories: cs.CV, cs.LG, cs.RO, cs.SE
- Published: March 2, 2026
- PDF: Download PDF