[Paper] SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

Published: (April 17, 2026 at 01:18 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.16262v1

Overview

The paper presents SwanNLP, a framework that leverages large language models (LLMs) to score how plausible a particular word sense is within a short narrative. By tackling SemEval‑2026 Task 5—plausibility scoring for narrative word‑sense disambiguation—the authors show that modern LLMs can mimic human judgments about which meaning of an ambiguous word “fits” a story context.

Key Contributions

  • LLM‑based plausibility scorer that combines structured reasoning with either fine‑tuned small models or dynamic few‑shot prompting of large commercial models.
  • Empirical comparison of low‑parameter fine‑tuned models vs. high‑parameter few‑shot prompting, revealing that the latter matches human plausibility ratings most closely.
  • Ensemble strategy that aggregates predictions from multiple LLMs, modestly improving alignment with the consensus of five human annotators.
  • Comprehensive analysis of reasoning strategies (e.g., chain‑of‑thought, contrastive prompting) and their effect on sense identification accuracy.

Methodology

  1. Task formulation – Each instance consists of a short story, a target homonymous word, and two candidate senses. The system must output a plausibility score (0–1) reflecting how likely a human would pick each sense.
  2. Model families
    • Fine‑tuned low‑parameter LLMs (≈ 300 M–1 B parameters) trained on a curated set of sense‑disambiguation examples with explicit reasoning prompts.
    • Dynamic few‑shot prompting of large commercial LLMs (≈ 10 B–175 B parameters) where the prompt is built on‑the‑fly from the most similar training examples.
  3. Structured reasoning – Both approaches prepend a “reasoning template” that forces the model to (a) restate the story, (b) list possible senses, (c) compare contextual clues, and (d) output a confidence score. This chain‑of‑thought style improves interpretability and consistency.
  4. Ensembling – Predictions from three diverse models (one fine‑tuned, two few‑shot) are averaged, and a simple calibration step aligns the ensemble output with the distribution of human annotator scores.

Results & Findings

Model typePlausibility‑F1 (average)Sense‑Acc (top‑1)
Fine‑tuned small LLM0.710.84
Large LLM + dynamic few‑shot0.780.89
Ensemble (3 models)0.800.91
  • Large LLMs with dynamic few‑shot prompting achieve the highest correlation with human plausibility judgments, surpassing fine‑tuned smaller models by ~7 % F1.
  • Ensembling yields a small but consistent boost, especially in cases where human annotators disagreed, indicating the ensemble better captures the “majority opinion”.
  • The structured reasoning prompt reduces variance across runs and makes the model’s decision process more transparent.

Practical Implications

  • Narrative‑aware applications – Chatbots, interactive fiction engines, and AI‑assisted writing tools can use the plausibility scorer to select word senses that keep stories coherent and natural‑sounding.
  • Content moderation & bias detection – By flagging implausible sense usages, platforms can spot awkward or potentially misleading language in user‑generated narratives.
  • Low‑resource adaptation – The fine‑tuning pipeline shows that even modest‑size models can be deployed on‑device (e.g., mobile writing assistants) with acceptable performance, while the few‑shot approach offers a plug‑and‑play API for cloud‑based services.
  • Explainable AI – The chain‑of‑thought output supplies a human‑readable rationale, useful for debugging or for compliance where model decisions need justification (e.g., educational software that explains word‑choice suggestions).

Limitations & Future Work

  • Domain coverage – The training and evaluation data focus on short literary excerpts; performance on technical prose, dialogues, or multilingual narratives remains untested.
  • Prompt engineering overhead – Dynamic few‑shot prompting requires a retrieval system to fetch relevant exemplars, adding latency for real‑time services.
  • Human agreement ceiling – Even the best model cannot exceed the inherent variability among annotators; future work could explore modeling individual annotator profiles or incorporating external knowledge bases (e.g., WordNet) to tighten the plausibility gap.

Bottom line: SwanNLP demonstrates that with the right prompting and reasoning scaffolds, today’s LLMs can reliably gauge how “natural” a word sense feels in a story—opening the door to smarter, context‑aware language tools for developers and content creators alike.*

Authors

  • Deshan Sumanathilaka
  • Nicholas Micallef
  • Julian Hough
  • Saman Jayasinghe

Paper Information

  • arXiv ID: 2604.16262v1
  • Categories: cs.CL
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »