[Paper] Event Extraction in Large Language Model

Published: (December 22, 2025 at 11:22 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.19537v1

Overview

The paper surveys how large language models (LLMs) are reshaping event extraction (EE)—the task of detecting events, their participants, timestamps, and causal links—from classic rule‑based pipelines to modern prompting‑driven, generative approaches. By treating EE as a cognitive scaffold for LLM‑centric systems, the authors outline how structured event schemas, intermediate representations, and persistent event stores can mitigate common LLM pitfalls such as hallucinations and limited context windows.

Key Contributions

  • Unified taxonomy of EE tasks across text and multimodal data, covering trigger detection, argument filling, temporal/causal linking, cross‑document reasoning, and cross‑lingual scenarios.
  • Historical roadmap tracing EE methods from handcrafted rules → neural sequence models → instruction‑tuned and generative LLM frameworks.
  • System‑level perspective: proposes four EE “interfaces” (schemas & constraints, event‑centric intermediate structures, graph‑based retrieval‑augmented generation, and persistent event stores) that turn raw LLM outputs into reliable, verifiable knowledge.
  • Comprehensive benchmark summary: catalogs datasets, evaluation metrics, and decoding strategies (e.g., constrained beam search, self‑consistency prompting) used in recent LLM‑based EE research.
  • Critical analysis of failure modes: identifies hallucination, fragile temporal/causal linking, and context‑window limits as the main bottlenecks for deploying LLM EE pipelines.
  • Future‑oriented research agenda: outlines directions such as graph‑aware prompting, episodic memory integration, multimodal grounding, and low‑resource adaptation.

Methodology

The authors conduct a survey‑style literature review augmented with a conceptual framework for building EE‑centric systems around LLMs:

  1. Task Formalization – Define EE as a series of structured prediction steps (trigger, arguments, links) that can be expressed as a sequence‑to‑sequence or generation problem.
  2. Prompt Engineering Taxonomy – Classify zero‑shot, few‑shot, and instruction‑tuned prompting strategies, and discuss decoding tricks (e.g., constrained decoding, self‑verification loops).
  3. Intermediate Representation Design – Propose event schemas (JSON‑like slots) and event graphs that act as “controlled” outputs, enabling downstream verification and reasoning.
  4. Retrieval‑Augmented Generation (RAG) – Show how event‑centric graphs can guide document retrieval, feeding relevant context back into the LLM for long‑range reasoning.
  5. Memory Layer – Introduce an event store that persists extracted events beyond the LLM’s context window, supporting episodic memory and continual learning.

The survey synthesizes findings from over 150 papers, comparing architectures (encoder‑decoder, decoder‑only, multimodal vision‑language models), datasets (ACE, MAVEN, EventStoryLine, multimodal video caption corpora), and evaluation protocols (F1, Slot‑F1, Temporal Accuracy).

Results & Findings

AspectObservation
AccuracyPrompt‑based LLMs (GPT‑4, Claude) achieve near‑state‑of‑the‑art trigger and argument F1 scores on ACE/MAVEN when equipped with schema‑constrained decoding, narrowing the gap with specialized neural EE models.
HallucinationWithout explicit constraints, LLMs generate spurious arguments (~15‑20% false positives). Schema‑guided prompts reduce this to <5%.
Temporal/Causal LinkingLong‑range linking remains weak; performance drops ~30% when events are separated by >3 sentences or span multiple documents. Graph‑aware RAG improves linking by ~12% absolute.
Multimodal EEVision‑language LLMs (e.g., Flamingo, BLIP‑2) can extract events from video subtitles + frames, but still lag behind pure text models by ~10% in argument recall.
Cross‑lingual TransferInstruction‑tuned multilingual LLMs (LLaMA‑2‑Chat) show promising zero‑shot performance on Chinese and Arabic EE benchmarks, though slot‑level precision remains ~5‑7 points lower than English.

Overall, the survey finds that structured prompting + verification loops are the most effective current recipe for reliable EE with LLMs.

Practical Implications

  1. Rapid Prototyping – Developers can spin up an EE service by simply wrapping an LLM with a JSON schema prompt, avoiding costly model‑training pipelines.
  2. Enterprise Knowledge Graphs – Event‑centric graphs extracted via LLMs can feed directly into downstream KG construction, enabling real‑time incident monitoring (e.g., cybersecurity alerts, supply‑chain disruptions).
  3. Customer Support Automation – By grounding LLM responses in verified event slots, chatbots can provide traceable explanations (e.g., “Your order was shipped on 2024‑11‑30”).
  4. Multimodal Surveillance – Combining video captioning with LLM‑driven EE allows automated detection of safety incidents (e.g., “person fell from ladder”) with minimal annotation effort.
  5. Long‑Term Memory for Agents – Persistent event stores let autonomous agents recall past actions beyond the LLM’s context window, supporting coherent planning over weeks or months.

Limitations & Future Work

  • Scalability of Verification – Schema‑constrained decoding reduces hallucinations but adds computational overhead; efficient verification mechanisms are still needed.
  • Long‑Document Reasoning – Current RAG approaches struggle with linking events across many documents; hierarchical graph retrieval is a promising direction.
  • Multimodal Fusion – Aligning visual cues with textual event slots remains an open challenge, especially for fine‑grained temporal relations.
  • Low‑Resource Languages – While multilingual LLMs show promise, performance gaps persist for languages with limited training data; cross‑lingual transfer learning and data augmentation are required.
  • Evaluation Standards – The field lacks a unified benchmark that jointly assesses trigger detection, argument filling, temporal/causal linking, and cross‑modal consistency.

The authors call for agent‑ready EE frameworks that combine LLM generation, graph‑based reasoning, and persistent memory to deliver reliable, open‑world event understanding.

Authors

  • Bobo Li
  • Xudong Han
  • Jiang Liu
  • Yuzhe Ding
  • Liqiang Jing
  • Zhaoqi Zhang
  • Jinheng Li
  • Xinya Du
  • Fei Li
  • Meishan Zhang
  • Min Zhang
  • Aixin Sun
  • Philip S. Yu
  • Hao Fei

Paper Information

  • arXiv ID: 2512.19537v1
  • Categories: cs.CL
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »