[Paper] Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

Published: 1 day ago (April 23, 2026 at 01:44 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21893v1

Overview

This paper investigates how to squeeze geographic insight out of publicly‑available data to improve motor‑insurance claim‑frequency models, even when the underlying actuarial dataset only contains coarse location tags (e.g., postcode zones). By blending traditional actuarial variables with environmental descriptors from OpenStreetMap, CORINE land‑cover maps, and satellite‑style ortho‑imagery, the authors show that smarter geography representation can boost predictive performance across a range of classic and modern machine‑learning models.

Key Contributions

Zone‑level framework: Demonstrates a practical way to work with limited spatial granularity (postcode zones) instead of exact addresses.
Multi‑source geographic features: Extracts and evaluates three geographic signal channels – raw coordinates, engineered environmental indicators, and deep‑learned image embeddings.
Model‑agnostic evaluation: Benchmarks the impact of geographic augmentation on GLMs, regularized GLMs (ridge/lasso), gradient‑boosted trees, and pure CNNs trained on raw imagery.
Scale analysis: Finds that a 5 km neighbourhood radius for environmental features yields the biggest accuracy lift, while finer (≤1 km) neighbourhoods still add value.
Vision‑transformer insight: Shows that pretrained vision‑transformer embeddings can rescue performance for linear models when handcrafted environmental data are unavailable.
Open‑science reproducibility: Uses the publicly released BeMTPL97 Belgian motor‑insurance dataset and openly accessible GIS layers, encouraging further research and industry pilots.

Methodology

Data preparation
- Actuarial core: Policy‑level risk factors (vehicle age, driver age, exposure, etc.) from the BeMTPL97 dataset.
- Geographic enrichment:
  - Coordinates: Latitude/longitude of each postcode centroid.
  - Environmental features: Counts/percentages of road types, land‑cover classes, points of interest, etc., aggregated within circular buffers of varying radii (0.5 km, 1 km, 5 km).
  - Ortho‑imagery: 256 × 256 px orthophotos (RGB) covering each zone, processed with a pretrained Vision Transformer (ViT) to obtain dense embeddings.
Model families
- GLM (Poisson): Classic actuarial baseline.
- Regularized GLM: Ridge/Lasso to handle high‑dimensional feature sets.
- Gradient‑Boosted Trees (XGBoost/LightGBM): Captures non‑linear interactions without heavy feature engineering.
- CNN: Directly ingests raw images for an end‑to‑end vision baseline.
Training & evaluation
- Split data by postcodes: train on a set of zones, test on unseen zones to mimic real‑world deployment where new geographic areas appear.
- Metrics: Mean Absolute Error (MAE) and Poisson deviance on claim‑frequency predictions.
- Ablation studies: add each geographic channel separately and in combination to isolate their marginal contribution.

Results & Findings

Model	Baseline (actuarial only)	+ Coordinates	+ Env. features (5 km)	+ Image embeddings*
GLM	MAE = 0.112	–0.004	–0.009	–0.003
Regularized GLM	MAE = 0.108	–0.003	–0.011	–0.015 (when env. missing)
Gradient‑Boosted Trees	MAE = 0.099	–0.006	–0.014	–0.005
CNN (raw images)	MAE = 0.105	–	–	–0.008

*Image embeddings improve regularized GLM only when environmental descriptors are omitted; otherwise they add little extra signal.

Take‑aways

Adding coordinates alone yields modest gains; the real boost comes from environmental aggregates at a 5 km scale.
Tree‑based models extract the most mileage from the combined geographic signal, shaving ~14 % off MAE.
Linear models can still benefit from vision‑transformer embeddings, offering a lightweight way to inject visual context without training a full CNN.
The predictive uplift is larger than the effect of model complexity, underscoring that how geography is represented matters more than what algorithm you use.

Practical Implications

InsurTech product teams can enrich existing underwriting pipelines with cheap GIS data (OSM, CORINE) without needing precise address data, preserving privacy while still gaining geographic insight.
Risk‑based pricing: More accurate zone‑level frequency forecasts enable finer granularity in premium adjustments, potentially reducing adverse selection.
Rapid prototyping: Developers can start with a regularized GLM plus pretrained ViT embeddings as a low‑compute baseline, then iterate to gradient‑boosted trees for maximum performance.
Regulatory compliance: Since the approach works on aggregated zones, it sidesteps many data‑privacy constraints that plague address‑level modeling.
Scalable deployment: Feature extraction (buffer counts, land‑cover percentages) can be pre‑computed in a data lake and refreshed periodically, making the solution production‑ready for large portfolios.

Limitations & Future Work

Geographic granularity: The study is limited to postcode zones; performance on finer (e.g., street‑level) or coarser (regional) aggregations remains unknown.
Domain specificity: Results are based on Belgian MTPL data; transferability to other countries with different road networks or claim cultures needs validation.
Image quality & coverage: Ortho‑imagery was limited to publicly released tiles; higher‑resolution or multispectral imagery could further improve visual embeddings.
Temporal dynamics: The models ignore how geographic risk evolves over time (e.g., new construction), an avenue for incorporating time‑aware GIS layers.
Explainability: While tree‑based models can provide feature importance, the black‑box nature of vision‑transformer embeddings makes it harder to interpret why a particular zone is riskier. Future work could explore attention‑map visualizations or hybrid models that retain interpretability.

Authors

Sherly Alfonso‑Sánchez
Cristián Bravo
Kristina G. Stankova

Paper Information

arXiv ID: 2604.21893v1
Categories: stat.ML, cs.LG, q-fin.RM
Published: April 23, 2026
PDF: Download PDF

[Paper] Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

[Paper] Fine-Tuning Regimes Define Distinct Continual Learning Problems

[Paper] The Sample Complexity of Multicalibration