[Paper] STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning
Source: arXiv - 2601.03248v1
Overview
The paper presents STReasoner, a novel framework that equips large language models (LLMs) with the ability to reason over spatio‑temporal data—think traffic flows, power‑grid measurements, or epidemic curves—by jointly processing time‑series signals, graph‑structured spatial relationships, and natural‑language context. To evaluate this capability, the authors also release ST‑Bench, a benchmark covering four core reasoning tasks, and demonstrate that their approach dramatically outperforms existing methods while costing a fraction of the compute of proprietary models.
Key Contributions
- ST‑Bench: A publicly released benchmark with four spatio‑temporal reasoning tasks (etiological reasoning, entity identification, correlation reasoning, in‑context forecasting) generated via a stochastic differential equation (SDE) multi‑agent simulator.
- STReasoner architecture: A plug‑and‑play pipeline that fuses raw time‑series, graph adjacency information, and textual prompts into a unified LLM input format.
- S‑GRPO (Spatial‑Guided Reinforcement Policy Optimization): A reinforcement‑learning‑based training loop that explicitly rewards improvements attributable to spatial cues, encouraging the model to ground its logic in the underlying network topology.
- Efficiency gains: Achieves 17 %–135 % higher accuracy across benchmark tasks while using only 0.004× the inference cost of leading closed‑source LLMs.
- Real‑world validation: Shows robust transfer from synthetic ST‑Bench data to publicly available traffic and power‑grid datasets without additional fine‑tuning.
Methodology
- Data synthesis – The authors build a multi‑agent simulator where each agent follows a stochastic differential equation (SDE) that governs its temporal evolution. Agents are placed on a graph that encodes spatial connectivity (e.g., road network, transmission lines). By tweaking interaction parameters they generate diverse scenarios for the four benchmark tasks.
- Input encoding – For each reasoning instance, three modalities are concatenated into a single prompt:
- Time‑series snippets (e.g., recent sensor readings) are tokenized using a simple quantization scheme.
- Graph context is expressed as edge‑list text (“Node A → Node B (weight = 0.8)”).
- Natural‑language query describes the reasoning goal (e.g., “Which sensor is most likely to fail next?”).
- LLM backbone – A standard decoder‑only LLM (e.g., LLaMA‑7B) is used as the base model.
- Spatial‑aware RL (S‑GRPO) – After supervised pre‑training on the synthetic data, the model is fine‑tuned with a reinforcement learning loop:
- The reward is decomposed into a spatial component (how much the answer improves when spatial edges are present) and a task component (overall correctness).
- Policy gradients push the model to generate answers that explicitly leverage spatial information, reducing reliance on spurious textual patterns.
- Evaluation – Accuracy, F1, and a new “spatial‑utilization score” (percentage of correct answers that change when spatial edges are shuffled) are reported across all tasks.
Results & Findings
| Task | Baseline (LLM‑only) | STReasoner (S‑GRPO) | Relative Gain |
|---|---|---|---|
| Etiological reasoning | 58 % | 84 % | +44 % |
| Entity identification | 62 % | 91 % | +47 % |
| Correlation reasoning | 55 % | 73 % | +33 % |
| In‑context forecasting | 61 % | 78 % | +28 % |
- Spatial‑utilization score jumps from ~12 % (baseline) to >70 % after S‑GRPO, confirming that the model is truly grounding its logic in the graph.
- Compute efficiency: Inference latency and GPU memory are ~0.4 % of what is required for comparable proprietary models (e.g., GPT‑4).
- Real‑world transfer: When tested on a city‑wide traffic dataset, STReasoner retains a 15 %–20 % accuracy edge over the baseline, despite being trained only on synthetic data.
Practical Implications
- Smart‑city services – Developers can plug STReasoner into traffic‑management dashboards to answer “why is congestion rising at this intersection?” or to forecast sensor failures before they happen.
- Power‑grid monitoring – Operators can query the model for root‑cause analysis of voltage anomalies, leveraging both SCADA time‑series and the grid topology.
- Epidemiology tools – Public‑health platforms can ask “which region is likely to see a spike next week given current case counts and mobility links?” without building a custom simulation.
- Cost‑effective AI – Because the approach works with modest‑size open‑source LLMs, startups and research labs can deploy spatio‑temporal reasoning without paying for expensive API calls.
- Extensible pipeline – The ST‑Bench data generator is open‑source, allowing teams to create domain‑specific synthetic scenarios (e.g., supply‑chain logistics) and fine‑tune the same architecture.
Limitations & Future Work
- Synthetic‑to‑real gap: Although transfer experiments are promising, performance still drops when moving to highly noisy, non‑stationary real data; additional domain adaptation may be needed.
- Graph size scalability: The current prompt‑based graph encoding becomes unwieldy for networks with >10 k nodes; future work could explore hierarchical graph summarization or retrieval‑augmented methods.
- Interpretability: While S‑GRPO encourages spatial grounding, the model’s internal reasoning steps remain opaque; integrating chain‑of‑thought prompting or explicit reasoning modules could improve transparency.
- Multi‑modal extensions: Incorporating satellite imagery or video streams alongside the time‑series could further enrich reasoning for applications like disaster response.
Overall, STReasoner opens a practical pathway for developers to harness LLMs as “spatio‑temporal analysts,” turning raw sensor streams and network maps into actionable insights with minimal compute overhead.
Authors
- Juntong Ni
- Shiyu Wang
- Ming Jin
- Qi He
- Wei Jin
Paper Information
- arXiv ID: 2601.03248v1
- Categories: cs.CL
- Published: January 6, 2026
- PDF: Download PDF