[Paper] AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

Published: (March 4, 2026 at 12:38 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.04319v1

Overview

The paper describes the winning system for SemEval‑2026 Task 12: Abductive Event Reasoning. By chaining a graph‑based retrieval module with a large language model (LLM) that is guided through “reflective prompting” and a final consistency‑checking step, the authors achieve a 0.95 accuracy, topping the competition leaderboard. Their analysis also uncovers systematic reasoning shortcuts that many modern LLMs share.

Key Contributions

  • Three‑stage pipeline that blends symbolic graph retrieval, LLM‑driven abductive inference, and post‑hoc consistency enforcement.
  • Reflective Prompt Evolution: a prompt‑design loop where the LLM critiques its own output and iteratively refines the reasoning trace.
  • Cross‑model error analysis across 14 models (7 families) that identifies three dominant inductive biases:
    1. Causal chain incompleteness – missing intermediate steps.
    2. Proximate cause preference – over‑relying on the nearest event.
    3. Salience bias – favoring highly salient entities regardless of relevance.
  • Empirical evidence that these biases are model‑agnostic, reducing the cause‑count by 51 % when corrected.

Methodology

  1. Graph‑Based Retrieval

    • A knowledge graph built from commonsense resources (e.g., ConceptNet, ATOMIC) is queried with the two premise events.
    • The graph returns a ranked set of candidate causal links and intermediate nodes that could bridge the premises.
  2. LLM‑Driven Abductive Reasoning

    • The retrieved candidates are fed to a powerful LLM (e.g., GPT‑4‑Turbo).
    • The prompt asks the model to select the most plausible hypothesis and explain the causal chain.
  3. Reflective Prompt Evolution

    • After the first answer, a secondary prompt asks the LLM to critique its own reasoning (e.g., “Is any step missing? Could there be an alternative cause?”).
    • The model revises its hypothesis and justification, iterating once or twice until a self‑consistency threshold is met.
  4. Post‑hoc Consistency Enforcement

    • A lightweight rule‑based verifier checks that the final hypothesis forms a valid causal chain (no cycles, all nodes connected).
    • If inconsistencies are detected, the system falls back to the second‑best candidate from the retrieval stage.
  5. Cross‑Model Bias Analysis

    • The same pipeline is run on 14 off‑the‑shelf LLMs.
    • Errors are categorized to surface shared inductive biases, which are then mitigated by adding targeted constraints to the reflective prompts.

Results & Findings

MetricScore
Overall Accuracy (test set)0.95 (1st place)
Ablation (no graph retrieval)0.88
Ablation (no reflective prompting)0.90
Consistency violations after post‑hoc check< 1 %
  • Graph retrieval supplies crucial grounding; without it, the LLM drifts to generic, less accurate hypotheses.
  • Reflective prompting improves accuracy by ~5 % by catching missing links and correcting proximate‑cause shortcuts.
  • The bias analysis shows that all 7 model families exhibit the three identified inductive biases, confirming they are systematic rather than architecture‑specific.

Practical Implications

  • Better commonsense reasoning APIs – The three‑stage design can be wrapped as a microservice, giving developers a plug‑and‑play “abductive reasoning” endpoint that outperforms raw LLM calls.
  • Debuggable AI pipelines – The reflective step provides a natural audit trail (the model’s self‑critique) useful for compliance and explainability requirements.
  • Bias mitigation toolkit – The identified biases can be encoded as prompt‑level constraints for any downstream task that involves causal inference (e.g., incident analysis, automated debugging, narrative generation).
  • Hybrid symbolic‑neural systems – Demonstrates a practical recipe for marrying knowledge graphs with LLMs, a pattern that can be reused for tasks like question answering, recommendation, or policy compliance checking.

Limitations & Future Work

  • Domain coverage: The graph is built from general‑purpose commonsense resources; specialized domains (medical, legal) would need custom graph extensions.
  • Scalability of retrieval: Real‑time graph queries can become a bottleneck for high‑throughput services; indexing optimizations are required.
  • Reflective loop depth: The current implementation caps at two reflection iterations; deeper self‑analysis may yield diminishing returns but remains unexplored.
  • Bias generalization: While three biases were identified, other subtle inductive biases (e.g., temporal ordering errors) may surface in different datasets and need further study.

Bottom line: By tightly integrating symbolic retrieval, self‑reflective prompting, and consistency checks, the authors deliver a robust, high‑accuracy solution for abductive event reasoning—an approach that developers can adapt to build more trustworthy, commonsense‑aware AI services.

Authors

  • Nikolas Karafyllis
  • Maria Lymperaiou
  • Giorgos Filandrianos
  • Athanasios Voulodimos
  • Giorgos Stamou

Paper Information

  • arXiv ID: 2603.04319v1
  • Categories: cs.CL
  • Published: March 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »