[Paper] AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Source: arXiv - 2603.04319v1
Overview
The paper describes the winning system for SemEval‑2026 Task 12: Abductive Event Reasoning. By chaining a graph‑based retrieval module with a large language model (LLM) that is guided through “reflective prompting” and a final consistency‑checking step, the authors achieve a 0.95 accuracy, topping the competition leaderboard. Their analysis also uncovers systematic reasoning shortcuts that many modern LLMs share.
Key Contributions
- Three‑stage pipeline that blends symbolic graph retrieval, LLM‑driven abductive inference, and post‑hoc consistency enforcement.
- Reflective Prompt Evolution: a prompt‑design loop where the LLM critiques its own output and iteratively refines the reasoning trace.
- Cross‑model error analysis across 14 models (7 families) that identifies three dominant inductive biases:
- Causal chain incompleteness – missing intermediate steps.
- Proximate cause preference – over‑relying on the nearest event.
- Salience bias – favoring highly salient entities regardless of relevance.
- Empirical evidence that these biases are model‑agnostic, reducing the cause‑count by 51 % when corrected.
Methodology
-
Graph‑Based Retrieval
- A knowledge graph built from commonsense resources (e.g., ConceptNet, ATOMIC) is queried with the two premise events.
- The graph returns a ranked set of candidate causal links and intermediate nodes that could bridge the premises.
-
LLM‑Driven Abductive Reasoning
- The retrieved candidates are fed to a powerful LLM (e.g., GPT‑4‑Turbo).
- The prompt asks the model to select the most plausible hypothesis and explain the causal chain.
-
Reflective Prompt Evolution
- After the first answer, a secondary prompt asks the LLM to critique its own reasoning (e.g., “Is any step missing? Could there be an alternative cause?”).
- The model revises its hypothesis and justification, iterating once or twice until a self‑consistency threshold is met.
-
Post‑hoc Consistency Enforcement
- A lightweight rule‑based verifier checks that the final hypothesis forms a valid causal chain (no cycles, all nodes connected).
- If inconsistencies are detected, the system falls back to the second‑best candidate from the retrieval stage.
-
Cross‑Model Bias Analysis
- The same pipeline is run on 14 off‑the‑shelf LLMs.
- Errors are categorized to surface shared inductive biases, which are then mitigated by adding targeted constraints to the reflective prompts.
Results & Findings
| Metric | Score |
|---|---|
| Overall Accuracy (test set) | 0.95 (1st place) |
| Ablation (no graph retrieval) | 0.88 |
| Ablation (no reflective prompting) | 0.90 |
| Consistency violations after post‑hoc check | < 1 % |
- Graph retrieval supplies crucial grounding; without it, the LLM drifts to generic, less accurate hypotheses.
- Reflective prompting improves accuracy by ~5 % by catching missing links and correcting proximate‑cause shortcuts.
- The bias analysis shows that all 7 model families exhibit the three identified inductive biases, confirming they are systematic rather than architecture‑specific.
Practical Implications
- Better commonsense reasoning APIs – The three‑stage design can be wrapped as a microservice, giving developers a plug‑and‑play “abductive reasoning” endpoint that outperforms raw LLM calls.
- Debuggable AI pipelines – The reflective step provides a natural audit trail (the model’s self‑critique) useful for compliance and explainability requirements.
- Bias mitigation toolkit – The identified biases can be encoded as prompt‑level constraints for any downstream task that involves causal inference (e.g., incident analysis, automated debugging, narrative generation).
- Hybrid symbolic‑neural systems – Demonstrates a practical recipe for marrying knowledge graphs with LLMs, a pattern that can be reused for tasks like question answering, recommendation, or policy compliance checking.
Limitations & Future Work
- Domain coverage: The graph is built from general‑purpose commonsense resources; specialized domains (medical, legal) would need custom graph extensions.
- Scalability of retrieval: Real‑time graph queries can become a bottleneck for high‑throughput services; indexing optimizations are required.
- Reflective loop depth: The current implementation caps at two reflection iterations; deeper self‑analysis may yield diminishing returns but remains unexplored.
- Bias generalization: While three biases were identified, other subtle inductive biases (e.g., temporal ordering errors) may surface in different datasets and need further study.
Bottom line: By tightly integrating symbolic retrieval, self‑reflective prompting, and consistency checks, the authors deliver a robust, high‑accuracy solution for abductive event reasoning—an approach that developers can adapt to build more trustworthy, commonsense‑aware AI services.
Authors
- Nikolas Karafyllis
- Maria Lymperaiou
- Giorgos Filandrianos
- Athanasios Voulodimos
- Giorgos Stamou
Paper Information
- arXiv ID: 2603.04319v1
- Categories: cs.CL
- Published: March 4, 2026
- PDF: Download PDF