[Paper] A Vision-and-Knowledge Enhanced Large Language Model for Generalizable Pedestrian Crossing Behavior Inference

Published: (January 2, 2026 at 09:13 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.00694v1

Overview

The paper presents Pedestrian Crossing LLM (PedX‑LLM), a novel framework that blends visual perception with domain‑specific knowledge to let a large language model reason about whether a pedestrian will cross the street. By moving from pure pattern‑matching to semantic, context‑aware inference, the authors achieve far better generalization to new, unseen locations—an essential step for real‑world traffic safety systems.

Key Contributions

  • Vision‑and‑knowledge integration: Combines visual embeddings extracted by LLaVA with textual transportation knowledge to enrich a LLaMA‑2‑7B model.
  • Low‑Rank Adaptation (LoRA) fine‑tuning: Efficiently adapts the large language model without full retraining, keeping computational costs modest.
  • Strong empirical gains: Reaches 82 % balanced accuracy on the full dataset, a 2.9 % boost from vision alone and an extra 4.1 % from domain knowledge.
  • Cross‑site generalization: Zero‑shot performance of 66.9 % on five completely unseen sites (≥ 18 pp improvement over traditional baselines).
  • Few‑shot adaptability: Adding just five validation examples lifts zero‑shot accuracy to 72.2 %, demonstrating rapid on‑the‑fly customization.

Methodology

  1. Data collection: Pedestrian videos and sensor logs from multiple urban sites, annotated with “cross” / “don’t cross” decisions.
  2. Visual feature extraction: Frames are fed to LLaVA (a vision‑language model) to produce dense embeddings that capture street layout, traffic signals, and surrounding objects.
  3. Knowledge injection: A curated set of transportation‑domain facts (e.g., right‑of‑way rules, typical crossing distances) is encoded as natural‑language prompts and concatenated with the visual embeddings.
  4. Model fine‑tuning: The combined token stream is used to fine‑tune LLaMA‑2‑7B via LoRA, which adds a small set of trainable matrices to each transformer layer, preserving the original knowledge while specializing it for crossing inference.
  5. Evaluation protocol:
    • Standard split: Random train/validation/test to measure overall accuracy.
    • Cross‑site split: Entire sites are held out for testing, mimicking deployment in a new city.
    • Zero‑shot vs. few‑shot: The model is first evaluated without any site‑specific examples (zero‑shot) and then with a handful of labeled examples (few‑shot).

Results & Findings

SettingBalanced Accuracy
Full dataset (random split)82.0 %
Vision‑only (no knowledge)79.1 %
Knowledge‑only (no vision)77.9 %
Zero‑shot cross‑site (5 unseen sites)66.9 %
Few‑shot (5 examples per site)72.2 %
  • Vision module contributes a 2.9 % lift by encoding the built environment (crosswalk markings, vehicle proximity, etc.).
  • Domain knowledge adds another 4.1 %, showing that explicit traffic rules complement raw visual cues.
  • Compared to the best statistical or supervised baselines, PedX‑LLM improves accuracy by ≥ 18 pp on unseen sites, confirming its superior generalizability.

Practical Implications

  • Smart traffic infrastructure: City‑wide pedestrian detection systems can deploy a single PedX‑LLM instance and expect reliable crossing predictions even in newly built districts, reducing the need for site‑specific data collection.
  • Advanced driver‑assistance (ADAS) & autonomous vehicles: Integrating PedX‑LLM enables more human‑like reasoning about pedestrian intent, improving safety margins in complex urban scenarios.
  • Rapid deployment: The few‑shot capability means that a municipality can fine‑tune the model with only a handful of locally labeled clips, cutting onboarding time from weeks to hours.
  • Scalable safety analytics: Researchers and safety auditors can run batch inference on city‑wide video feeds to identify high‑risk crossing locations without retraining separate models per site.

Limitations & Future Work

  • Data diversity: The study relies on a limited number of urban environments; performance in rural or highly congested megacity settings remains untested.
  • Real‑time constraints: While LoRA reduces training cost, inference latency with vision‑language pipelines may still be too high for ultra‑low‑latency ADAS loops; model compression or edge‑optimized variants are needed.
  • Knowledge base scope: The current rule set covers basic right‑of‑way and crossing geometry; extending it to weather conditions, pedestrian demographics, or cultural crossing habits could further boost accuracy.
  • Explainability: Although the model mimics human reasoning, providing transparent justification for each prediction (e.g., “visible traffic signal is red”) is an open challenge for safety certification.

PedX‑LLM illustrates how coupling visual perception with structured domain knowledge can turn a generic LLM into a robust, generalizable reasoning engine for safety‑critical tasks—an approach that could be replicated across many other urban AI applications.

Authors

  • Qingwen Pu
  • Kun Xie
  • Hong Yang
  • Guocong Zhai

Paper Information

  • arXiv ID: 2601.00694v1
  • Categories: cs.AI
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »