[Paper] SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

Published: 3 weeks ago (April 17, 2026 at 01:18 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16262v1

Overview

The paper presents SwanNLP, a framework that leverages large language models (LLMs) to score how plausible a particular word sense is within a short narrative. By tackling SemEval‑2026 Task 5—plausibility scoring for narrative word‑sense disambiguation—the authors show that modern LLMs can mimic human judgments about which meaning of an ambiguous word “fits” a story context.

Key Contributions

LLM‑based plausibility scorer that combines structured reasoning with either fine‑tuned small models or dynamic few‑shot prompting of large commercial models.
Empirical comparison of low‑parameter fine‑tuned models vs. high‑parameter few‑shot prompting, revealing that the latter matches human plausibility ratings most closely.
Ensemble strategy that aggregates predictions from multiple LLMs, modestly improving alignment with the consensus of five human annotators.
Comprehensive analysis of reasoning strategies (e.g., chain‑of‑thought, contrastive prompting) and their effect on sense identification accuracy.

Methodology

Task formulation – Each instance consists of a short story, a target homonymous word, and two candidate senses. The system must output a plausibility score (0–1) reflecting how likely a human would pick each sense.
Model families
- Fine‑tuned low‑parameter LLMs (≈ 300 M–1 B parameters) trained on a curated set of sense‑disambiguation examples with explicit reasoning prompts.
- Dynamic few‑shot prompting of large commercial LLMs (≈ 10 B–175 B parameters) where the prompt is built on‑the‑fly from the most similar training examples.
Structured reasoning – Both approaches prepend a “reasoning template” that forces the model to (a) restate the story, (b) list possible senses, (c) compare contextual clues, and (d) output a confidence score. This chain‑of‑thought style improves interpretability and consistency.
Ensembling – Predictions from three diverse models (one fine‑tuned, two few‑shot) are averaged, and a simple calibration step aligns the ensemble output with the distribution of human annotator scores.

Results & Findings

Model type	Plausibility‑F1 (average)	Sense‑Acc (top‑1)
Fine‑tuned small LLM	0.71	0.84
Large LLM + dynamic few‑shot	0.78	0.89
Ensemble (3 models)	0.80	0.91

Large LLMs with dynamic few‑shot prompting achieve the highest correlation with human plausibility judgments, surpassing fine‑tuned smaller models by ~7 % F1.
Ensembling yields a small but consistent boost, especially in cases where human annotators disagreed, indicating the ensemble better captures the “majority opinion”.
The structured reasoning prompt reduces variance across runs and makes the model’s decision process more transparent.

Practical Implications

Narrative‑aware applications – Chatbots, interactive fiction engines, and AI‑assisted writing tools can use the plausibility scorer to select word senses that keep stories coherent and natural‑sounding.
Content moderation & bias detection – By flagging implausible sense usages, platforms can spot awkward or potentially misleading language in user‑generated narratives.
Low‑resource adaptation – The fine‑tuning pipeline shows that even modest‑size models can be deployed on‑device (e.g., mobile writing assistants) with acceptable performance, while the few‑shot approach offers a plug‑and‑play API for cloud‑based services.
Explainable AI – The chain‑of‑thought output supplies a human‑readable rationale, useful for debugging or for compliance where model decisions need justification (e.g., educational software that explains word‑choice suggestions).

Limitations & Future Work

Domain coverage – The training and evaluation data focus on short literary excerpts; performance on technical prose, dialogues, or multilingual narratives remains untested.
Prompt engineering overhead – Dynamic few‑shot prompting requires a retrieval system to fetch relevant exemplars, adding latency for real‑time services.
Human agreement ceiling – Even the best model cannot exceed the inherent variability among annotators; future work could explore modeling individual annotator profiles or incorporating external knowledge bases (e.g., WordNet) to tighten the plausibility gap.

Bottom line: SwanNLP demonstrates that with the right prompting and reasoning scaffolds, today’s LLMs can reliably gauge how “natural” a word sense feels in a story—opening the door to smarter, context‑aware language tools for developers and content creators alike.

Authors

Deshan Sumanathilaka
Nicholas Micallef
Julian Hough
Saman Jayasinghe

Paper Information

arXiv ID: 2604.16262v1
Categories: cs.CL
Published: April 17, 2026
PDF: Download PDF

[Paper] SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text