[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Published: (November 28, 2025 at 11:09 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.23311v1

Overview

A new study explores how large‑scale vision‑language models (LVLMs) can be turned into “co‑pilots” that watch both the road ahead and the driver’s cabin, then automatically generate safety‑focused driving instructions. By building a dedicated dataset and fine‑tuning existing LVLMs, the authors show that these models can move beyond generic image captioning toward real‑time, safety‑aware assistance for drivers.

Key Contributions

  • Dual‑view dataset: Collected and annotated synchronized road‑facing and driver‑facing video clips with safety‑relevant events (e.g., mobile phone use, drowsiness, lane violations).
  • LVLM adaptation pipeline: Demonstrated a practical fine‑tuning workflow that injects safety‑oriented language grounding into pre‑trained vision‑language models.
  • Benchmark & evaluation: Defined quantitative metrics (instruction accuracy, hazard detection recall) and qualitative analyses to assess LVLM performance on the dual‑view task.
  • Error taxonomy: Identified common failure modes (subtle gestures, occlusions, multi‑modal reasoning gaps) to guide future model improvements.

Methodology

  1. Data collection – The team recorded thousands of short driving sessions using two synchronized cameras: one mounted on the windshield (road view) and one facing the driver (cabin view). Each clip was labeled with a concise safety instruction (e.g., “Please put the phone away”) and the underlying hazard.
  2. Model backbone – They started from publicly available LVLMs that combine a vision encoder (e.g., CLIP‑ViT) with a large language model (e.g., LLaMA).
  3. Fine‑tuning strategy
    • Multi‑modal fusion: Concatenated embeddings from the two video streams before feeding them to the language decoder.
    • Instruction‑tuning: Trained the model on a mixture of “question → answer” and “image → instruction” pairs, emphasizing safety‑related prompts.
    • Temporal handling: Applied a lightweight transformer over frame‑level features to capture short‑term dynamics (e.g., hand reaching for a phone).
  4. Evaluation – Measured how often the generated instruction matched the ground‑truth label (exact match), and computed recall for each hazard category. Human judges also rated the usefulness of the instructions.

Results & Findings

ModelExact‑match Instruction AccuracyHazard Recall (avg.)
Pre‑trained LVLM (no fine‑tune)38 %32 %
Fine‑tuned LVLM (dual‑view)71 %68 %
Human baseline*94 %92 %
  • Fine‑tuned LVLMs more than doubled the accuracy of raw pre‑trained models.
  • The biggest gains were seen for obvious hazards (e.g., “phone on lap”) while subtle cues (e.g., micro‑yawning) still lag behind.
  • Human evaluators rated the model’s instructions as “helpful” in 63 % of cases, compared to 85 % for the human baseline.

Practical Implications

  • In‑vehicle safety assistants: Automakers can embed a dual‑camera LVLM module to provide real‑time verbal prompts, reducing distracted‑driving incidents without requiring expensive LiDAR or radar setups.
  • Fleet monitoring: Logistics companies could deploy the system on dashcams to flag risky driver behavior for post‑trip review, improving compliance and insurance outcomes.
  • Regulatory compliance tools: The model’s ability to generate explicit safety instructions aligns with emerging mandates for driver‑monitoring systems in many jurisdictions.
  • Extensible platform: Because the approach builds on generic LVLMs, it can be adapted to other domains (e.g., construction site safety, cockpit monitoring) with modest data collection.

Limitations & Future Work

  • Subtle event detection – The model still struggles with low‑visibility cues such as brief glances at a phone or early signs of fatigue.
  • Temporal scope – Current architecture only looks at a few seconds of video; longer‑range reasoning (e.g., predicting lane drift) remains unexplored.
  • Dataset bias – The collected clips are limited to specific vehicle models and lighting conditions, which may affect generalization to diverse real‑world fleets.
  • Explainability – The system outputs instructions but does not provide visual evidence (e.g., bounding boxes) for why a hazard was flagged, which could hinder driver trust.

Future research directions include integrating attention‑based visual grounding, expanding the dataset to cover night‑time and adverse weather, and coupling LVLMs with sensor fusion (e.g., CAN‑bus data) for richer context.

Human baseline derived from expert annotators who watched the same clips and wrote the optimal instruction.

Authors

  • Haruki Sakajo
  • Hiroshi Takato
  • Hiroshi Tsutsui
  • Komei Soda
  • Hidetaka Kamigaito
  • Taro Watanabe

Paper Information

  • arXiv ID: 2511.23311v1
  • Categories: cs.CV, cs.AI, cs.CL
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »