[Paper] SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
Source: arXiv - 2604.16262v1
Overview
The paper presents SwanNLP, a framework that leverages large language models (LLMs) to score how plausible a particular word sense is within a short narrative. By tackling SemEval‑2026 Task 5—plausibility scoring for narrative word‑sense disambiguation—the authors show that modern LLMs can mimic human judgments about which meaning of an ambiguous word “fits” a story context.
Key Contributions
- LLM‑based plausibility scorer that combines structured reasoning with either fine‑tuned small models or dynamic few‑shot prompting of large commercial models.
- Empirical comparison of low‑parameter fine‑tuned models vs. high‑parameter few‑shot prompting, revealing that the latter matches human plausibility ratings most closely.
- Ensemble strategy that aggregates predictions from multiple LLMs, modestly improving alignment with the consensus of five human annotators.
- Comprehensive analysis of reasoning strategies (e.g., chain‑of‑thought, contrastive prompting) and their effect on sense identification accuracy.
Methodology
- Task formulation – Each instance consists of a short story, a target homonymous word, and two candidate senses. The system must output a plausibility score (0–1) reflecting how likely a human would pick each sense.
- Model families
- Fine‑tuned low‑parameter LLMs (≈ 300 M–1 B parameters) trained on a curated set of sense‑disambiguation examples with explicit reasoning prompts.
- Dynamic few‑shot prompting of large commercial LLMs (≈ 10 B–175 B parameters) where the prompt is built on‑the‑fly from the most similar training examples.
- Structured reasoning – Both approaches prepend a “reasoning template” that forces the model to (a) restate the story, (b) list possible senses, (c) compare contextual clues, and (d) output a confidence score. This chain‑of‑thought style improves interpretability and consistency.
- Ensembling – Predictions from three diverse models (one fine‑tuned, two few‑shot) are averaged, and a simple calibration step aligns the ensemble output with the distribution of human annotator scores.
Results & Findings
| Model type | Plausibility‑F1 (average) | Sense‑Acc (top‑1) |
|---|---|---|
| Fine‑tuned small LLM | 0.71 | 0.84 |
| Large LLM + dynamic few‑shot | 0.78 | 0.89 |
| Ensemble (3 models) | 0.80 | 0.91 |
- Large LLMs with dynamic few‑shot prompting achieve the highest correlation with human plausibility judgments, surpassing fine‑tuned smaller models by ~7 % F1.
- Ensembling yields a small but consistent boost, especially in cases where human annotators disagreed, indicating the ensemble better captures the “majority opinion”.
- The structured reasoning prompt reduces variance across runs and makes the model’s decision process more transparent.
Practical Implications
- Narrative‑aware applications – Chatbots, interactive fiction engines, and AI‑assisted writing tools can use the plausibility scorer to select word senses that keep stories coherent and natural‑sounding.
- Content moderation & bias detection – By flagging implausible sense usages, platforms can spot awkward or potentially misleading language in user‑generated narratives.
- Low‑resource adaptation – The fine‑tuning pipeline shows that even modest‑size models can be deployed on‑device (e.g., mobile writing assistants) with acceptable performance, while the few‑shot approach offers a plug‑and‑play API for cloud‑based services.
- Explainable AI – The chain‑of‑thought output supplies a human‑readable rationale, useful for debugging or for compliance where model decisions need justification (e.g., educational software that explains word‑choice suggestions).
Limitations & Future Work
- Domain coverage – The training and evaluation data focus on short literary excerpts; performance on technical prose, dialogues, or multilingual narratives remains untested.
- Prompt engineering overhead – Dynamic few‑shot prompting requires a retrieval system to fetch relevant exemplars, adding latency for real‑time services.
- Human agreement ceiling – Even the best model cannot exceed the inherent variability among annotators; future work could explore modeling individual annotator profiles or incorporating external knowledge bases (e.g., WordNet) to tighten the plausibility gap.
Bottom line: SwanNLP demonstrates that with the right prompting and reasoning scaffolds, today’s LLMs can reliably gauge how “natural” a word sense feels in a story—opening the door to smarter, context‑aware language tools for developers and content creators alike.*
Authors
- Deshan Sumanathilaka
- Nicholas Micallef
- Julian Hough
- Saman Jayasinghe
Paper Information
- arXiv ID: 2604.16262v1
- Categories: cs.CL
- Published: April 17, 2026
- PDF: Download PDF