[Paper] Event Extraction in Large Language Model

Published: 6 days ago (December 22, 2025 at 11:22 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.19537v1

Overview

The paper surveys how large language models (LLMs) are reshaping event extraction (EE)—the task of detecting events, their participants, timestamps, and causal links—from classic rule‑based pipelines to modern prompting‑driven, generative approaches. By treating EE as a cognitive scaffold for LLM‑centric systems, the authors outline how structured event schemas, intermediate representations, and persistent event stores can mitigate common LLM pitfalls such as hallucinations and limited context windows.

Key Contributions

Unified taxonomy of EE tasks across text and multimodal data, covering trigger detection, argument filling, temporal/causal linking, cross‑document reasoning, and cross‑lingual scenarios.
Historical roadmap tracing EE methods from handcrafted rules → neural sequence models → instruction‑tuned and generative LLM frameworks.
System‑level perspective: proposes four EE “interfaces” (schemas & constraints, event‑centric intermediate structures, graph‑based retrieval‑augmented generation, and persistent event stores) that turn raw LLM outputs into reliable, verifiable knowledge.
Comprehensive benchmark summary: catalogs datasets, evaluation metrics, and decoding strategies (e.g., constrained beam search, self‑consistency prompting) used in recent LLM‑based EE research.
Critical analysis of failure modes: identifies hallucination, fragile temporal/causal linking, and context‑window limits as the main bottlenecks for deploying LLM EE pipelines.
Future‑oriented research agenda: outlines directions such as graph‑aware prompting, episodic memory integration, multimodal grounding, and low‑resource adaptation.

Methodology

The authors conduct a survey‑style literature review augmented with a conceptual framework for building EE‑centric systems around LLMs:

Task Formalization – Define EE as a series of structured prediction steps (trigger, arguments, links) that can be expressed as a sequence‑to‑sequence or generation problem.
Prompt Engineering Taxonomy – Classify zero‑shot, few‑shot, and instruction‑tuned prompting strategies, and discuss decoding tricks (e.g., constrained decoding, self‑verification loops).
Intermediate Representation Design – Propose event schemas (JSON‑like slots) and event graphs that act as “controlled” outputs, enabling downstream verification and reasoning.
Retrieval‑Augmented Generation (RAG) – Show how event‑centric graphs can guide document retrieval, feeding relevant context back into the LLM for long‑range reasoning.
Memory Layer – Introduce an event store that persists extracted events beyond the LLM’s context window, supporting episodic memory and continual learning.

The survey synthesizes findings from over 150 papers, comparing architectures (encoder‑decoder, decoder‑only, multimodal vision‑language models), datasets (ACE, MAVEN, EventStoryLine, multimodal video caption corpora), and evaluation protocols (F1, Slot‑F1, Temporal Accuracy).

Results & Findings

Aspect	Observation
Accuracy	Prompt‑based LLMs (GPT‑4, Claude) achieve near‑state‑of‑the‑art trigger and argument F1 scores on ACE/MAVEN when equipped with schema‑constrained decoding, narrowing the gap with specialized neural EE models.
Hallucination	Without explicit constraints, LLMs generate spurious arguments (~15‑20% false positives). Schema‑guided prompts reduce this to <5%.
Temporal/Causal Linking	Long‑range linking remains weak; performance drops ~30% when events are separated by >3 sentences or span multiple documents. Graph‑aware RAG improves linking by ~12% absolute.
Multimodal EE	Vision‑language LLMs (e.g., Flamingo, BLIP‑2) can extract events from video subtitles + frames, but still lag behind pure text models by ~10% in argument recall.
Cross‑lingual Transfer	Instruction‑tuned multilingual LLMs (LLaMA‑2‑Chat) show promising zero‑shot performance on Chinese and Arabic EE benchmarks, though slot‑level precision remains ~5‑7 points lower than English.

Overall, the survey finds that structured prompting + verification loops are the most effective current recipe for reliable EE with LLMs.

Practical Implications

Rapid Prototyping – Developers can spin up an EE service by simply wrapping an LLM with a JSON schema prompt, avoiding costly model‑training pipelines.
Enterprise Knowledge Graphs – Event‑centric graphs extracted via LLMs can feed directly into downstream KG construction, enabling real‑time incident monitoring (e.g., cybersecurity alerts, supply‑chain disruptions).
Customer Support Automation – By grounding LLM responses in verified event slots, chatbots can provide traceable explanations (e.g., “Your order was shipped on 2024‑11‑30”).
Multimodal Surveillance – Combining video captioning with LLM‑driven EE allows automated detection of safety incidents (e.g., “person fell from ladder”) with minimal annotation effort.
Long‑Term Memory for Agents – Persistent event stores let autonomous agents recall past actions beyond the LLM’s context window, supporting coherent planning over weeks or months.

Limitations & Future Work

Scalability of Verification – Schema‑constrained decoding reduces hallucinations but adds computational overhead; efficient verification mechanisms are still needed.
Long‑Document Reasoning – Current RAG approaches struggle with linking events across many documents; hierarchical graph retrieval is a promising direction.
Multimodal Fusion – Aligning visual cues with textual event slots remains an open challenge, especially for fine‑grained temporal relations.
Low‑Resource Languages – While multilingual LLMs show promise, performance gaps persist for languages with limited training data; cross‑lingual transfer learning and data augmentation are required.
Evaluation Standards – The field lacks a unified benchmark that jointly assesses trigger detection, argument filling, temporal/causal linking, and cross‑modal consistency.

The authors call for agent‑ready EE frameworks that combine LLM generation, graph‑based reasoning, and persistent memory to deliver reliable, open‑world event understanding.

Authors

Bobo Li
Xudong Han
Jiang Liu
Yuzhe Ding
Liqiang Jing
Zhaoqi Zhang
Jinheng Li
Xinya Du
Fei Li
Meishan Zhang
Min Zhang
Aixin Sun
Philip S. Yu
Hao Fei

Paper Information

arXiv ID: 2512.19537v1
Categories: cs.CL
Published: December 22, 2025
PDF: Download PDF

[Paper] Event Extraction in Large Language Model

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents