[Paper] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks
Source: arXiv - 2601.00736v1
Overview
The paper investigates how modern large language models (LLMs) perform when asked to locate subjective text spans—the exact words that convey sentiment, offensiveness, or factual claims. While most prior work has used smaller models (e.g., BERT) for classic span‑tagging tasks like NER, this study is one of the first to systematically evaluate LLMs on more nuanced, opinion‑based span identification tasks.
Key Contributions
- Comprehensive benchmark across three real‑world tasks: aspect‑based sentiment analysis, offensive language detection, and claim verification.
- Systematic comparison of several LLM prompting strategies—plain zero‑shot, instruction‑tuned prompts, in‑context learning (few‑shot examples), and chain‑of‑thought (CoT) reasoning.
- Empirical evidence that underlying textual relationships (e.g., sentiment cues, discourse markers) help LLMs pinpoint spans more accurately than baseline methods.
- Open‑source evaluation scripts and a reproducible leaderboard for future research on subjective span identification.
Methodology
-
Datasets – The authors selected publicly available corpora for each task:
- Sentiment: SemEval‑ABSA datasets with aspect terms and polarity spans.
- Offensive: OLID (Offensive Language Identification Dataset) with annotated offensive spans.
- Claim verification: FEVER‑S with evidence sentence spans.
-
LLM families – Experiments used several state‑of‑the‑art models (e.g., GPT‑3.5, Claude‑2, LLaMA‑2) accessed via API or open‑source checkpoints.
-
Prompt designs – Four prompting regimes were tested:
- Zero‑shot: a single instruction asking the model to “highlight the span that expresses the sentiment/offense/claim.”
- Instruction‑tuned: a more detailed prompt that defines the span‑identification task and provides formatting guidelines.
- In‑context learning: 2–3 exemplars showing input text, the target span, and the expected output format.
- Chain‑of‑thought: the model first explains why a particular fragment is relevant before outputting the span.
-
Evaluation metrics – Span‑level precision, recall, and F1 were computed using exact‑match and partial‑overlap criteria (similar to the standard “token‑level” evaluation in NER).
-
Baselines – BERT‑based token classifiers trained on the same data served as strong, task‑specific baselines.
Results & Findings
| Task | Best LLM Prompt | F1 (Exact) | F1 (Partial) | BERT Baseline |
|---|---|---|---|---|
| Sentiment (ABSA) | CoT + In‑context | 78.4 | 85.1 | 71.2 |
| Offensive | Instruction‑tuned | 74.9 | 82.3 | 68.7 |
| Claim verification | In‑context (3‑shot) | 71.5 | 79.0 | 66.4 |
- Chain‑of‑thought reasoning consistently boosted performance on the more nuanced sentiment task, suggesting that prompting the model to “think aloud” helps it resolve ambiguous cues.
- In‑context examples were especially valuable for claim verification, where the model needed to understand the logical relationship between premise and evidence.
- Across all tasks, LLMs outperformed the BERT baselines despite not being fine‑tuned on the specific datasets, highlighting the power of large‑scale pre‑training combined with smart prompting.
Practical Implications
- Explainable AI: Developers can leverage LLMs to generate human‑readable justifications (the highlighted span) for sentiment or moderation decisions, improving transparency in user‑facing applications.
- Rapid prototyping: Because the best results come from prompting alone, teams can build functional span‑extraction pipelines without costly annotation or fine‑tuning cycles.
- Content moderation: The offensive‑language findings suggest that LLMs can pinpoint the exact offending phrase, enabling more precise automated edits or warnings.
- Fact‑checking tools: Accurate evidence‑span extraction can feed downstream verification engines, reducing the manual effort required to locate supporting sentences in large corpora.
Limitations & Future Work
- Prompt sensitivity – Performance varies noticeably with prompt wording; the paper notes that systematic prompt‑search is still an open problem.
- API constraints – Some LLMs were accessed via commercial APIs, limiting reproducibility for researchers without paid access.
- Domain coverage – Experiments focus on English news/social‑media data; cross‑lingual or domain‑specific (e.g., medical) span identification remains untested.
- Scalability – While zero‑shot prompting is cheap, in‑context learning with multiple examples can increase latency and token costs, which may be prohibitive for high‑throughput services.
Future directions include automated prompt optimization, fine‑tuning smaller LLMs on span‑annotation data to close the gap between cost and performance, and extending the benchmark to multilingual settings.
Authors
- Alphaeus Dmonte
- Roland Oruche
- Tharindu Ranasinghe
- Marcos Zampieri
- Prasad Calyam
Paper Information
- arXiv ID: 2601.00736v1
- Categories: cs.CL, cs.AI
- Published: January 2, 2026
- PDF: Download PDF