[Paper] AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

Published: 1 day ago (March 4, 2026 at 12:38 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04319v1

Overview

The paper describes the winning system for SemEval‑2026 Task 12: Abductive Event Reasoning. By chaining a graph‑based retrieval module with a large language model (LLM) that is guided through “reflective prompting” and a final consistency‑checking step, the authors achieve a 0.95 accuracy, topping the competition leaderboard. Their analysis also uncovers systematic reasoning shortcuts that many modern LLMs share.

Key Contributions

Three‑stage pipeline that blends symbolic graph retrieval, LLM‑driven abductive inference, and post‑hoc consistency enforcement.
Reflective Prompt Evolution: a prompt‑design loop where the LLM critiques its own output and iteratively refines the reasoning trace.
Cross‑model error analysis across 14 models (7 families) that identifies three dominant inductive biases:
1. Causal chain incompleteness – missing intermediate steps.
2. Proximate cause preference – over‑relying on the nearest event.
3. Salience bias – favoring highly salient entities regardless of relevance.
Empirical evidence that these biases are model‑agnostic, reducing the cause‑count by 51 % when corrected.

Methodology

Graph‑Based Retrieval
- A knowledge graph built from commonsense resources (e.g., ConceptNet, ATOMIC) is queried with the two premise events.
- The graph returns a ranked set of candidate causal links and intermediate nodes that could bridge the premises.
LLM‑Driven Abductive Reasoning
- The retrieved candidates are fed to a powerful LLM (e.g., GPT‑4‑Turbo).
- The prompt asks the model to select the most plausible hypothesis and explain the causal chain.
Reflective Prompt Evolution
- After the first answer, a secondary prompt asks the LLM to critique its own reasoning (e.g., “Is any step missing? Could there be an alternative cause?”).
- The model revises its hypothesis and justification, iterating once or twice until a self‑consistency threshold is met.
Post‑hoc Consistency Enforcement
- A lightweight rule‑based verifier checks that the final hypothesis forms a valid causal chain (no cycles, all nodes connected).
- If inconsistencies are detected, the system falls back to the second‑best candidate from the retrieval stage.
Cross‑Model Bias Analysis
- The same pipeline is run on 14 off‑the‑shelf LLMs.
- Errors are categorized to surface shared inductive biases, which are then mitigated by adding targeted constraints to the reflective prompts.

Results & Findings

Metric	Score
Overall Accuracy (test set)	0.95 (1st place)
Ablation (no graph retrieval)	0.88
Ablation (no reflective prompting)	0.90
Consistency violations after post‑hoc check	< 1 %

Graph retrieval supplies crucial grounding; without it, the LLM drifts to generic, less accurate hypotheses.
Reflective prompting improves accuracy by ~5 % by catching missing links and correcting proximate‑cause shortcuts.
The bias analysis shows that all 7 model families exhibit the three identified inductive biases, confirming they are systematic rather than architecture‑specific.

Practical Implications

Better commonsense reasoning APIs – The three‑stage design can be wrapped as a microservice, giving developers a plug‑and‑play “abductive reasoning” endpoint that outperforms raw LLM calls.
Debuggable AI pipelines – The reflective step provides a natural audit trail (the model’s self‑critique) useful for compliance and explainability requirements.
Bias mitigation toolkit – The identified biases can be encoded as prompt‑level constraints for any downstream task that involves causal inference (e.g., incident analysis, automated debugging, narrative generation).
Hybrid symbolic‑neural systems – Demonstrates a practical recipe for marrying knowledge graphs with LLMs, a pattern that can be reused for tasks like question answering, recommendation, or policy compliance checking.

Limitations & Future Work

Domain coverage: The graph is built from general‑purpose commonsense resources; specialized domains (medical, legal) would need custom graph extensions.
Scalability of retrieval: Real‑time graph queries can become a bottleneck for high‑throughput services; indexing optimizations are required.
Reflective loop depth: The current implementation caps at two reflection iterations; deeper self‑analysis may yield diminishing returns but remains unexplored.
Bias generalization: While three biases were identified, other subtle inductive biases (e.g., temporal ordering errors) may surface in different datasets and need further study.

Bottom line: By tightly integrating symbolic retrieval, self‑reflective prompting, and consistency checks, the authors deliver a robust, high‑accuracy solution for abductive event reasoning—an approach that developers can adapt to build more trustworthy, commonsense‑aware AI services.

Authors

Nikolas Karafyllis
Maria Lymperaiou
Giorgos Filandrianos
Athanasios Voulodimos
Giorgos Stamou

Paper Information

arXiv ID: 2603.04319v1
Categories: cs.CL
Published: March 4, 2026
PDF: Download PDF

[Paper] AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought