[Paper] ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Published: 3 days ago (February 16, 2026 at 01:16 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.14989v1

Overview

ThermEval introduces the first large‑scale benchmark for testing how well vision‑language models (VLMs) understand thermal imagery—the kind of heat‑based pictures used in night‑time surveillance, search‑and‑rescue drones, autonomous vehicles, and medical screening. By exposing the blind spot of current RGB‑centric VLMs, the work pushes the community toward models that can reason about temperature, not just color.

Key Contributions

ThermEval‑B: ~55 k curated thermal visual‑question‑answer (VQA) pairs covering temperature‑grounded reasoning, object detection, and scene understanding.
ThermEval‑D: A novel dataset that provides dense per‑pixel temperature maps together with semantic body‑part annotations for indoor and outdoor scenes.
Comprehensive evaluation of 25 open‑source and commercial VLMs, revealing systematic failures on temperature‑related queries.
Analysis of failure modes: models rely on language priors, collapse under colormap changes, and show negligible improvement from prompting or fine‑tuning.
Open‑source benchmark suite (code, data, evaluation scripts) to enable reproducible research and future extensions.

Methodology

Data Assembly – Public thermal image collections (e.g., FLIR‑ADAS, KAIST) were merged with the newly captured ThermEval‑D, which includes precise temperature readings for every pixel and manual body‑part labels.
Question Generation – For each image, a mix of automatically generated and human‑written questions was created, targeting:
- Temperature extraction (“What is the temperature of the car’s hood?”)
- Relative heat reasoning (“Is the person in front hotter than the dog?”)
- Cross‑modal inference (“Which area would be visible in the dark visible‑light image?”)
Benchmark Structure – Questions are grouped into 7 primitive skill categories (e.g., “absolute temperature”, “heat gradient”, “thermal occlusion”) to diagnose specific reasoning gaps.
Model Evaluation – Each VLM receives the thermal image (either raw 16‑bit data or a false‑color colormap) plus the question. Answers are compared against ground‑truth using exact‑match and soft‑BLEU metrics. Prompt engineering and a lightweight supervised fine‑tune (≤ 5 k examples) are also tested.

The pipeline is deliberately kept simple so that developers can plug in any VLM without needing specialized thermal preprocessing.

Results & Findings

Model family	Raw thermal input	Colormap input	Avg. accuracy (out of 100)
Open‑source CLIP‑based VLMs	22	15	18
Proprietary GPT‑4‑V (vision)	31	24	27
Fine‑tuned on 5 k ThermEval examples	35	28	31

Temperature grounding is near‑random: even the best‑performing model answers correctly on only ~30 % of absolute temperature questions.
Colormap brittleness: converting raw heat values to false‑color images drops performance by ~20 % across the board.
Language bias: when temperature information is ambiguous, models default to high‑frequency answers (“warm”, “hot”) regardless of the image.
Prompting helps little: adding “Answer in degrees Celsius” improves accuracy by < 3 % on average.
Fine‑tuning yields marginal gains: 5 k supervised examples raise scores by ~5 pts, indicating that the gap is not just data scarcity but a fundamental architectural mismatch.

Practical Implications

Safety‑critical systems (autonomous cars, UAVs) cannot rely on off‑the‑shelf VLMs for thermal perception; dedicated thermal modules or multimodal adapters are needed.
Rapid prototyping: the benchmark’s modular design lets developers test custom temperature‑aware heads or sensor‑fusion pipelines without building a full dataset from scratch.
Edge deployment: since raw 16‑bit thermal data is more informative than colormapped versions, pipelines should preserve the original temperature channel rather than converting to RGB for inference.
Regulatory compliance: in medical screening (e.g., fever detection), models must demonstrate temperature‑grounded reasoning; ThermEval provides a concrete validation suite.
Research direction: the findings motivate new architectures that treat temperature as a physical scalar field (e.g., incorporating physics‑informed layers or contrastive temperature embeddings).

Limitations & Future Work

Domain coverage: while ThermEval‑D spans indoor/outdoor scenes, it lacks extreme environments (e.g., wildfire, industrial furnaces) where temperature ranges exceed current sensor limits.
Annotation granularity: body‑part temperature labels are coarse (pixel‑level averages) and may miss fine‑grained vascular patterns important for medical use.
Model diversity: the study focused on publicly available VLMs; proprietary models with internal thermal pretraining could behave differently.
Future extensions proposed by the authors:
1. Adding video‑based thermal reasoning tasks.
2. Expanding to multimodal fusion with LiDAR or radar.
3. Exploring self‑supervised pretraining on raw thermal streams to close the performance gap.

Authors

Ayush Shrivastava
Kirtan Gangani
Laksh Jain
Mayank Goel
Nipun Batra

Paper Information

arXiv ID: 2602.14989v1
Categories: cs.CV, cs.AI, cs.LG
Published: February 16, 2026
PDF: Download PDF

[Paper] ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Are Object-Centric Representations Better At Compositional Generalization?

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

[Paper] B-DENSE: Branching For Dense Ensemble Network Learning

[Paper] Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation