[Paper] EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

Published: (April 23, 2026 at 01:42 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.21890v1

Overview

The paper introduces EVENT5Ws, a new, large‑scale, manually annotated dataset for open‑domain event extraction from full‑document texts. By covering a broad spectrum of event types and providing statistically verified annotations, the dataset fills a critical gap that has limited the development of robust, real‑world event‑extraction systems.

Key Contributions

  • EVENT5Ws dataset: > 200 k event instances spanning the classic “5 Ws” (who, what, when, where, why) across diverse domains and geographies.
  • Systematic annotation pipeline: A reproducible workflow that combines expert guidelines, crowdsourced validation, and statistical quality checks.
  • Benchmarking suite: Evaluation of several state‑of‑the‑art pre‑trained large language models (LLMs) on EVENT5Ws, establishing baseline performance numbers.
  • Cross‑domain generalization study: Demonstrates that models fine‑tuned on EVENT5Ws transfer well to other event‑extraction corpora (e.g., crisis‑report datasets from different regions).
  • Practical lessons & recommendations: A concise “cookbook” for building large, high‑quality annotation projects in NLP.

Methodology

  1. Data Collection – The authors scraped publicly available news articles, blog posts, and reports covering a wide range of topics (politics, natural disasters, sports, etc.).
  2. Annotation Schema – Each event is broken down into the five canonical components (who, what, when, where, why). Annotators label spans in the original document that answer each component.
  3. Annotation Pipeline
    • Guideline design: Detailed examples and edge‑case handling rules.
    • Crowdsourced labeling: Multiple annotators per document; majority voting resolves disagreements.
    • Expert review: A subset is double‑checked by domain experts to compute inter‑annotator agreement (Cohen’s κ ≈ 0.78).
    • Statistical verification: Bootstrapped sampling ensures the final set meets a predefined confidence threshold for label accuracy.
  4. Model Evaluation – Fine‑tune several LLMs (BERT, RoBERTa, T5, GPT‑3.5) on the training split, then test on held‑out EVENT5Ws data and on external event‑extraction benchmarks.

The pipeline is deliberately modular, allowing teams to swap in different annotators, models, or quality‑control steps without redesigning the whole process.

Results & Findings

ModelF1 (5Ws) on EVENT5WsTransfer F1 on External Set
BERT‑base62.4%55.1%
RoBERTa‑large68.9%60.3%
T5‑base (seq2seq)71.2%63.7%
GPT‑3.5 (few‑shot)74.5%66.8%
  • Higher coverage matters: Models trained on EVENT5Ws outperform the same architectures trained on older, closed‑domain datasets by 8–12 percentage points on both in‑domain and out‑of‑domain tests.
  • Few‑shot prompting works: Even without fine‑tuning, GPT‑3.5 achieves competitive scores, highlighting the dataset’s usefulness for prompt engineering research.
  • Annotation complexity: The “why” component proved hardest (average annotator agreement 0.62), confirming the need for richer context when extracting motivations.

Practical Implications

  • Better crisis‑response tools – Developers building dashboards for emergency management can now train models that reliably pull out who acted, what happened, when, where, and why, directly from incident reports.
  • Automated knowledge‑graph construction – EVENT5Ws provides the raw material for populating event‑centric KG triples, enabling downstream applications like timeline generation or recommendation engines.
  • Prompt‑engineering datasets – The 5Ws format aligns naturally with instruction‑following LLMs, making the dataset a ready‑made benchmark for evaluating prompt designs.
  • Cross‑regional deployment – Because models fine‑tuned on EVENT5Ws generalize across geographies, companies can roll out a single extraction service for multilingual news feeds with minimal re‑training.

Limitations & Future Work

  • Domain bias – Although the source collection is diverse, it leans heavily toward English‑language news; low‑resource languages remain under‑represented.
  • Granularity of “why” – The authors note that causal reasoning often requires external world knowledge, which the current annotations do not capture.
  • Scalability of manual verification – Even with crowdsourcing, the verification step is costly; future work could explore semi‑automated quality checks using model‑in‑the‑loop methods.
  • Temporal dynamics – The dataset treats each document as static; extending it to handle event evolution over time (e.g., updates to a breaking story) is an open research direction.

EVENT5Ws is poised to become a cornerstone resource for anyone building real‑world event‑extraction pipelines, from AI‑powered newsroom tools to emergency‑response analytics platforms. By lowering the data barrier and offering a clear roadmap for large‑scale annotation, the paper paves the way for more robust, generalizable NLP systems that understand the “who, what, when, where, and why” of the world around us.

Authors

  • Praval Sharma
  • Ashok Samal
  • Leen‑Kiat Soh
  • Deepti Joshi

Paper Information

  • arXiv ID: 2604.21890v1
  • Categories: cs.CL
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »