[Paper] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs
Source: arXiv - 2511.20965v1
Overview
The paper TrafficLens tackles a real‑world bottleneck: turning streams from dozens of traffic‑camera feeds into actionable, textual insights fast enough for city operators and law‑enforcement teams. By cleverly chaining Vision‑Language Models (VLMs) and a lightweight similarity filter, the authors cut the video‑to‑text conversion time by up to 4× while keeping the description accuracy high enough for practical use.
Key Contributions
- Sequential multi‑camera pipeline that re‑uses the output of one camera as the prompt for the next, exploiting overlapping fields of view.
- Dynamic token budgeting: each VLM call is sized to the camera’s remaining “information budget,” preventing unnecessary long prompts.
- Object‑level similarity detector that skips VLM processing for frames that add no new visual content, dramatically reducing redundant work.
- Real‑world evaluation on multi‑camera intersection datasets showing a 4× speed‑up with negligible loss in descriptive fidelity.
Methodology
- Pre‑processing & Overlap Mapping – The system first builds a map of which cameras share visual coverage (e.g., two adjacent lenses both see the same lane).
- Iterative VLM Invocation –
- Camera 1’s video segment is fed to a Vision‑Language Model, producing a concise textual description.
- This description becomes part of the prompt for Camera 2, which now only needs to describe what’s new beyond what Camera 1 already captured.
- The process repeats for all cameras in the overlap chain.
- Token‑limit Adaptation – Each VLM call respects a per‑camera token ceiling; if the previous description already consumes most of the budget, the next call is trimmed accordingly.
- Object‑Level Similarity Check – Before invoking the VLM, a lightweight detector (e.g., a fast CNN + cosine similarity on object embeddings) compares the current frame’s detected objects with those already reported. If similarity exceeds a threshold, the VLM step is skipped and the previous text is reused.
The overall flow is a retrieval‑augmented generation loop where visual data is “retrieved” (via similarity detection) and then “generated” (via the VLM) in a cascade that mirrors how a human analyst would skim overlapping camera feeds.
Results & Findings
| Metric | Baseline (independent VLM per camera) | TrafficLens |
|---|---|---|
| Avg. conversion time per intersection (seconds) | 12.8 | 3.2 (≈ 4× faster) |
| Textual fidelity (BLEU‑4) | 0.71 | 0.68 (Δ ≈ 4 %) |
| Redundant VLM calls eliminated | 0 % | 62 % |
| End‑to‑end latency for a 30‑second incident clip | 15 s | 4.5 s |
The authors report that the slight dip in BLEU‑4 is mostly due to omitted repetitive details (e.g., “a sedan continues straight”) that the similarity filter pruned—information that is rarely needed for incident reporting.
Practical Implications
- Faster incident response – Operators can query a multi‑camera intersection and receive a coherent textual summary in under 5 seconds, enabling near‑real‑time decision making.
- Cost‑effective scaling – By cutting the number of expensive VLM calls, city IT budgets can support more cameras without proportional cloud‑compute spend.
- Improved searchable archives – The generated text can be indexed for keyword search, making post‑event investigations (e.g., “find all red trucks at 5 pm”) much quicker.
- Plug‑and‑play for existing ITS stacks – TrafficLens is a pipeline wrapper; it can sit on top of any off‑the‑shelf VLM (e.g., GPT‑4V, LLaVA) and any object detector, requiring only configuration of overlap maps.
Developers building smart‑city dashboards, autonomous‑vehicle simulation platforms, or law‑enforcement video‑review tools can adopt TrafficLens to turn raw video streams into structured, searchable narratives without redesigning their entire vision stack.
Limitations & Future Work
- Dependence on accurate overlap maps – Mis‑aligned camera geometry can cause missed information or duplicate descriptions.
- Similarity detector thresholds are heuristic – Over‑aggressive pruning may drop subtle but important events (e.g., a pedestrian stepping off the curb).
- Evaluation limited to a single city’s dataset – Broader testing across varied lighting, weather, and camera qualities is needed.
- Future directions include learning the overlap graph automatically, integrating temporal reasoning (e.g., tracking a vehicle across cameras), and extending the approach to multimodal queries (audio + video).
Authors
- Md Adnan Arefeen
- Biplob Debnath
- Srimat Chakradhar
Paper Information
- arXiv ID: 2511.20965v1
- Categories: cs.CV, cs.CL
- Published: November 26, 2025
- PDF: Download PDF