[Paper] Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting

Published: (November 28, 2025 at 12:27 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23387v1

Overview

The Hierarchical AI‑Meteorologist paper introduces a novel LLM‑agent system that turns raw weather data into clear, explainable forecasts. By reasoning at multiple time‑scales (hourly, 6‑hourly, daily) and extracting concise “weather keywords,” the system produces narratives that are both human‑readable and machine‑verifiable—addressing a long‑standing gap between data‑driven models and trustworthy weather reporting.

Key Contributions

  • Hierarchical reasoning framework that fuses short‑term and long‑term meteorological signals before generating text.
  • Dual‑output LLM agent: simultaneously creates a natural‑language forecast and a short list of semantic keywords summarizing dominant weather events.
  • Keyword‑anchored validation: uses the extracted keywords to check temporal coherence, factual consistency, and overall plausibility of the generated report.
  • Open‑source reproducible pipeline built on publicly available OpenWeather and Meteostat datasets, enabling other researchers and developers to replicate and extend the approach.
  • Demonstrated improvement in interpretability and robustness compared with flat, single‑scale LLM forecasting baselines.

Methodology

  1. Data Ingestion – Raw observations (temperature, wind, precipitation, etc.) are pulled from OpenWeather and Meteostat APIs and pre‑processed into structured time‑series tables at three granularities: hourly, 6‑hourly, and daily.
  2. Hierarchical Context Construction – The three granularities are fed into a lightweight transformer encoder that learns cross‑scale relationships (e.g., a sudden temperature dip in the hourly slice that aligns with a larger cold‑front trend in the daily slice).
  3. LLM‑Agent Prompting – The encoded context is inserted into a prompt for a large language model (e.g., GPT‑4‑style). The prompt explicitly asks the model to:
    • Write a concise weather narrative for the target region and period.
    • Output 3‑5 “weather keywords” that capture the most salient phenomena (e.g., cold‑front, heavy‑rain, gusty‑winds).
  4. Keyword‑Based Consistency Checks – After generation, a lightweight rule‑based verifier cross‑references the keywords with the original structured data. If mismatches are detected (e.g., a keyword “snow” but no snowfall in the data), the system can request a regeneration or flag the report for human review.
  5. Evaluation – The authors compare the hierarchical system against a flat baseline (single‑scale LLM) using both automatic metrics (BLEU, ROUGE) and human expert ratings for clarity, factuality, and usefulness.

Results & Findings

MetricHierarchical AI‑MeteorologistFlat LLM Baseline
BLEU (forecast text)0.420.31
ROUGE‑L (summary quality)0.580.44
Keyword‑Data Alignment93 % correct71 % correct
Human expert rating (1‑5) – Clarity4.63.8
Human expert rating – Factual consistency4.73.9
  • The hierarchical model consistently produced more accurate and coherent narratives, especially for multi‑day forecasts where trend aggregation matters.
  • Keyword extraction proved a reliable “semantic anchor”: mismatches dropped dramatically, and the verification step caught 87 % of factual errors before they reached the end user.
  • Qualitative feedback highlighted that developers found the keyword list useful for downstream automation (e.g., triggering alerts or populating UI widgets).

Practical Implications

  • Automated Weather Services – Companies that provide weather APIs can embed the hierarchical agent to generate ready‑to‑publish text, reducing manual editorial effort.
  • Alert & Notification Systems – The concise keyword set can feed directly into rule‑based alert pipelines (e.g., “if heavy‑rain appears, send flood warning”).
  • Localization & Accessibility – Because the LLM produces natural language, the same pipeline can be re‑prompted for different languages or simplified summaries for non‑technical audiences.
  • Explainable AI Audits – The keyword‑anchored validation offers a transparent audit trail, satisfying regulatory or compliance requirements for AI‑generated content.
  • Edge Deployment – The hierarchical encoder is lightweight enough to run on edge servers close to data sources, enabling near‑real‑time forecast generation for IoT devices (smart agriculture, autonomous drones, etc.).

Limitations & Future Work

  • Model Dependency – The quality hinges on the underlying LLM; smaller or open‑source models may not match the reported performance without fine‑tuning.
  • Geographic Scope – Experiments focused on mid‑latitude regions with dense observation networks; performance in data‑sparse areas (e.g., oceans, remote polar zones) remains untested.
  • Keyword Granularity – Fixed‑size keyword lists may miss nuanced phenomena; future work could explore hierarchical keyword trees or dynamic length selection.
  • Real‑Time Constraints – While the encoder is efficient, the full LLM inference can still be latency‑heavy for ultra‑low‑latency applications; model distillation or caching strategies are suggested next steps.

Overall, the Hierarchical AI‑Meteorologist showcases a promising path toward trustworthy, explainable AI‑driven weather reporting—bridging the gap between raw meteorological data and developer‑friendly, actionable insights.

Authors

  • Daniil Sukhorukov
  • Andrei Zakharov
  • Nikita Glazkov
  • Katsiaryna Yanchanka
  • Vladimir Kirilin
  • Maxim Dubovitsky
  • Roman Sultimov
  • Yuri Maksimov
  • Ilya Makarov

Paper Information

  • arXiv ID: 2511.23387v1
  • Categories: cs.AI
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »