[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

Published: 1 day ago (March 2, 2026 at 01:12 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.02150v1

Overview

The paper introduces CrimeNER, a new benchmark for zero‑ and few‑shot Named‑Entity Recognition (NER) focused on crime‑related texts. By releasing a curated dataset of >1.5 k annotated documents from terrorist‑attack reports and U.S. DOJ press releases, the authors fill a glaring gap in publicly available, high‑quality crime‑domain annotations and demonstrate how modern NER models perform when only a handful of examples are available.

Key Contributions

CrimeNERdb: a publicly released corpus of 1,543 documents annotated with 5 coarse‑grained and 22 fine‑grained crime entity types.
Zero‑ and Few‑Shot Evaluation Protocol: systematic experiments that measure how well state‑of‑the‑art NER models and large language models (LLMs) generalize to the crime domain with 0, 1, 5, and 10 labeled examples per class.
Benchmark Results: comprehensive performance tables for token‑level models (e.g., BERT‑CRF, SpanBERT) and prompting‑based LLMs (e.g., GPT‑3.5, LLaMA‑2), highlighting the gap between fully supervised and low‑resource settings.
Error‑Analysis Toolkit: qualitative analysis of common failure modes (e.g., entity boundary ambiguity, domain‑specific terminology) that can guide future model improvements.
Open‑Source Release: dataset, annotation guidelines, and evaluation scripts are made available under an open license, encouraging reproducibility and community contributions.

Methodology

Data Collection & Annotation
- Sources: public terrorism incident reports (e.g., Global Terrorism Database) and U.S. Department of Justice press notes.
- Annotation schema: 5 high‑level categories (e.g., PERPETRATOR, VICTIM, LOCATION, WEAPON, CRIME_TYPE) and 22 detailed sub‑types (e.g., GUN_TYPE, FINANCIAL_MOTIVE).
- Quality control: double‑annotation with adjudication, achieving a Cohen’s κ of 0.84 for coarse labels.
Zero‑/Few‑Shot Setup
- Zero‑Shot: models receive only the label definitions (no training examples).
- Few‑Shot: models are fine‑tuned or prompted with 1, 5, or 10 randomly sampled annotated sentences per entity type.
- Baselines: classic CRF, BERT‑based token classifiers, and recent span‑based architectures.
LLM Prompting
- Structured prompts that list entity types and ask the model to label a given sentence.
- Experiments with both zero‑shot (no examples) and few‑shot (in‑context examples) for GPT‑3.5‑Turbo, Claude‑2, and LLaMA‑2‑13B.
Evaluation
- Standard NER metrics (precision, recall, F1) computed at both coarse and fine granularity.
- Statistical significance testing (bootstrap) to compare models across shot levels.

Results & Findings

Model	Shots	Coarse‑F1	Fine‑F1
BERT‑CRF (full‑supervised)	100 %	92.1	84.3
SpanBERT (few‑shot)	10 samples	78.4	62.7
GPT‑3.5‑Turbo (zero‑shot)	0	61.2	48.5
GPT‑3.5‑Turbo (5‑shot)	5	73.9	58.1
LLaMA‑2‑13B (10‑shot)	10	71.5	55.4

Performance Gap: Even the strongest LLMs lag behind a fully supervised BERT‑CRF by ~15–20 F1 points, confirming the difficulty of the crime domain.
Few‑Shot Gains: Adding just 5–10 examples yields a 10–12 point jump in F1 for LLMs, showing that in‑context learning is highly effective when the prompt is well‑crafted.
Fine‑Grained Challenge: All models struggle more on the 22 sub‑types, especially rare entities like FINANCIAL_MOTIVE or CYBER_WEAPON.
Error Patterns: Mis‑labeling of multi‑word entities (e.g., “armed robbery” split into CRIME_TYPE + WEAPON) and confusion between PERPETRATOR vs. ACCOMPLICE are the most common mistakes.

Practical Implications

Law‑Enforcement Automation: CrimeNER can be plugged into pipelines that ingest incident reports, automatically extracting suspects, victims, and weapon details for faster case triage.
Threat‑Intelligence Platforms: Security analysts can use few‑shot fine‑tuned LLMs to parse open‑source intelligence (OSINT) feeds without the need for costly annotation campaigns.
Compliance & Auditing: Companies handling legal documents (e.g., compliance reports) can leverage the dataset to train domain‑specific NER models that flag criminal‑related clauses.
Rapid Prototyping: The few‑shot benchmarks demonstrate that a developer can achieve usable performance with as few as 5 annotated sentences—making PoC development feasible for startups and NGOs.
Cross‑Domain Transfer: Insights from CrimeNER can inform low‑resource NER in other high‑stakes domains (e.g., medical adverse events, financial fraud) where annotated data is scarce.

Limitations & Future Work

Domain Coverage: The corpus focuses on U.S. DOJ releases and terrorism reports; it may not capture nuances of organized crime, cybercrime, or non‑English contexts.
Class Imbalance: Some fine‑grained entities appear in fewer than 20 instances, limiting the reliability of few‑shot results for those types.
Prompt Sensitivity: LLM performance varies significantly with prompt phrasing; the study does not exhaustively explore prompt engineering strategies.

Future Directions

Expanding the dataset to include multilingual crime reports and court transcripts.
Investigating adapter‑based or parameter‑efficient fine‑tuning to improve few‑shot performance without full model retraining.
Developing hierarchical NER models that first predict coarse categories then refine to fine‑grained types, reducing error propagation.

CrimeNER opens the door for practical, low‑resource NER in a high‑impact domain. By making the data and evaluation framework openly available, the authors invite the community to build the next generation of intelligent tools for public safety and legal analytics.

Authors

Miguel Lopez-Duran
Julian Fierrez
Aythami Morales
Daniel DeAlcala
Gonzalo Mancera
Javier Irigoyen
Ruben Tolosana
Oscar Delgado
Francisco Jurado
Alvaro Ortigosa

Paper Information

arXiv ID: 2603.02150v1
Categories: cs.CL, cs.AI, cs.DB
Published: March 2, 2026
PDF: Download PDF

[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Future Directions

Authors

Paper Information

Related posts

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

[Paper] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations

[Paper] Recursive Models for Long-Horizon Reasoning