[Paper] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Published: (December 31, 2025 at 12:31 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.24985v1

Overview

The paper DarkEQA introduces the first benchmark that tests how well vision‑language models (VLMs) can answer questions while “seeing” through the darkness of low‑light indoor scenes. By simulating realistic night‑time lighting and sensor noise, the authors expose a hidden perception bottleneck that most existing embodied‑AI evaluations completely ignore.

Key Contributions

  • DarkEQA benchmark – a publicly released dataset that pairs egocentric video frames with question‑answer pairs across multiple, precisely calibrated low‑light levels.
  • Physically‑based degradation pipeline – renders low‑light images in linear RAW space, applies illumination drop and realistic sensor noise, then runs an ISP‑style tone‑mapping to mimic real camera output.
  • Systematic evaluation of dozens of state‑of‑the‑art VLMs (e.g., CLIP‑V, BLIP‑2, LLaVA) and low‑light image enhancement (LLIE) models on the same benchmark.
  • Attributable robustness analysis – isolates perception from reasoning, showing how much performance loss is due to visual quality versus model architecture.
  • Open‑source release – code, data, and evaluation scripts will be made available, enabling the community to extend the benchmark to new models or environments.

Methodology

  1. Environment & Data Generation – The authors start from existing embodied‑question‑answering (EQA) scenes (e.g., AI2‑THOR rooms) and capture raw sensor data from a simulated camera.
  2. Low‑Light Simulation – Light intensity is reduced in linear RAW space (0 % to 5 % of original illumination) and realistic photon shot noise plus read‑out noise are added.
  3. ISP Rendering – The noisy RAW images are processed through a simplified image‑signal‑processor pipeline (demosaicing, white‑balance, gamma correction) to produce the final RGB frames that a VLM would actually see.
  4. Benchmark Construction – For each lighting level, a set of navigation‑free egocentric frames is paired with natural‑language questions (e.g., “What color is the lamp on the table?”). The ground‑truth answers are derived from the simulator’s object metadata.
  5. Evaluation Protocol – VLMs receive the degraded frames and the question, then generate an answer. Accuracy is measured with exact‑match and fuzzy‑match metrics. LLIE models are optionally inserted as a pre‑processor to see if enhancement helps VLM performance.

Results & Findings

  • Sharp performance drop – Most VLMs lose 30‑50 % absolute accuracy when illumination falls below 2 % of the original level, even though the underlying scene layout stays unchanged.
  • LLIE helps, but not enough – Applying top‑ranked low‑light enhancement models (e.g., KinD, EnlightenGAN) recovers only ~10‑15 % of the lost accuracy, indicating that VLMs are still brittle to residual artifacts.
  • Model‑specific trends – Larger, instruction‑tuned VLMs (LLaVA‑13B) degrade more gracefully than smaller CLIP‑based models, suggesting that richer language priors can partially compensate for visual noise.
  • Perception vs. reasoning – When the same questions are answered using perfect (well‑lit) images, all models achieve >90 % accuracy, confirming that the bottleneck is primarily visual.
  • Cross‑lighting robustness – Training a VLM on a mixture of lighting conditions (data augmentation) improves low‑light performance by ~20 % relative, but still lags behind well‑lit performance.

Practical Implications

  • Robotics & Home Assistants – Service robots that need to operate 24/7 (e.g., night‑time security patrols, bedside assistance) cannot rely on off‑the‑shelf VLMs without additional low‑light handling.
  • AR/VR & Wearables – Head‑mounted devices used in dim environments (e.g., warehouses, hospitals) will benefit from integrating LLIE front‑ends or training VLMs on DarkEQA‑style data.
  • Edge Deployment – The benchmark highlights that simply scaling model size isn’t enough; developers should consider sensor‑level improvements (larger apertures, infrared) or lightweight denoising modules that fit on edge hardware.
  • Evaluation Standards – DarkEQA provides a reproducible way to stress‑test any embodied AI pipeline before field deployment, encouraging more robust product releases.

Limitations & Future Work

  • Simulation‑only – The benchmark relies on synthetic RAW generation; real‑world low‑light capture may introduce additional complexities (e.g., motion blur, color casts).
  • Static lighting levels – Only a handful of discrete illumination levels are tested; continuous adaptation to dynamic lighting changes remains unexplored.
  • Focus on perception – While isolating perception is useful, future work should evaluate end‑to‑end navigation + QA under low light to capture interaction effects.
  • Broader modalities – Extending the benchmark to multimodal sensors (depth, infrared) could open new avenues for robust embodied reasoning.

Authors

  • Yohan Park
  • Hyunwoo Ha
  • Wonjun Jo
  • Tae‑Hyun Oh

Paper Information

  • arXiv ID: 2512.24985v1
  • Categories: cs.CV, cs.AI, cs.LG, cs.RO
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Web World Models

Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web fra...