[Paper] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

Published: 1 month ago (January 2, 2026 at 11:30 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.00736v1

Overview

The paper investigates how modern large language models (LLMs) perform when asked to locate subjective text spans—the exact words that convey sentiment, offensiveness, or factual claims. While most prior work has used smaller models (e.g., BERT) for classic span‑tagging tasks like NER, this study is one of the first to systematically evaluate LLMs on more nuanced, opinion‑based span identification tasks.

Key Contributions

Comprehensive benchmark across three real‑world tasks: aspect‑based sentiment analysis, offensive language detection, and claim verification.
Systematic comparison of several LLM prompting strategies—plain zero‑shot, instruction‑tuned prompts, in‑context learning (few‑shot examples), and chain‑of‑thought (CoT) reasoning.
Empirical evidence that underlying textual relationships (e.g., sentiment cues, discourse markers) help LLMs pinpoint spans more accurately than baseline methods.
Open‑source evaluation scripts and a reproducible leaderboard for future research on subjective span identification.

Methodology

Datasets – The authors selected publicly available corpora for each task:
- Sentiment: SemEval‑ABSA datasets with aspect terms and polarity spans.
- Offensive: OLID (Offensive Language Identification Dataset) with annotated offensive spans.
- Claim verification: FEVER‑S with evidence sentence spans.
LLM families – Experiments used several state‑of‑the‑art models (e.g., GPT‑3.5, Claude‑2, LLaMA‑2) accessed via API or open‑source checkpoints.
Prompt designs – Four prompting regimes were tested:
- Zero‑shot: a single instruction asking the model to “highlight the span that expresses the sentiment/offense/claim.”
- Instruction‑tuned: a more detailed prompt that defines the span‑identification task and provides formatting guidelines.
- In‑context learning: 2–3 exemplars showing input text, the target span, and the expected output format.
- Chain‑of‑thought: the model first explains why a particular fragment is relevant before outputting the span.
Evaluation metrics – Span‑level precision, recall, and F1 were computed using exact‑match and partial‑overlap criteria (similar to the standard “token‑level” evaluation in NER).
Baselines – BERT‑based token classifiers trained on the same data served as strong, task‑specific baselines.

Results & Findings

Task	Best LLM Prompt	F1 (Exact)	F1 (Partial)	BERT Baseline
Sentiment (ABSA)	CoT + In‑context	78.4	85.1	71.2
Offensive	Instruction‑tuned	74.9	82.3	68.7
Claim verification	In‑context (3‑shot)	71.5	79.0	66.4

Chain‑of‑thought reasoning consistently boosted performance on the more nuanced sentiment task, suggesting that prompting the model to “think aloud” helps it resolve ambiguous cues.
In‑context examples were especially valuable for claim verification, where the model needed to understand the logical relationship between premise and evidence.
Across all tasks, LLMs outperformed the BERT baselines despite not being fine‑tuned on the specific datasets, highlighting the power of large‑scale pre‑training combined with smart prompting.

Practical Implications

Explainable AI: Developers can leverage LLMs to generate human‑readable justifications (the highlighted span) for sentiment or moderation decisions, improving transparency in user‑facing applications.
Rapid prototyping: Because the best results come from prompting alone, teams can build functional span‑extraction pipelines without costly annotation or fine‑tuning cycles.
Content moderation: The offensive‑language findings suggest that LLMs can pinpoint the exact offending phrase, enabling more precise automated edits or warnings.
Fact‑checking tools: Accurate evidence‑span extraction can feed downstream verification engines, reducing the manual effort required to locate supporting sentences in large corpora.

Limitations & Future Work

Prompt sensitivity – Performance varies noticeably with prompt wording; the paper notes that systematic prompt‑search is still an open problem.
API constraints – Some LLMs were accessed via commercial APIs, limiting reproducibility for researchers without paid access.
Domain coverage – Experiments focus on English news/social‑media data; cross‑lingual or domain‑specific (e.g., medical) span identification remains untested.
Scalability – While zero‑shot prompting is cheap, in‑context learning with multiple examples can increase latency and token costs, which may be prohibitive for high‑throughput services.

Future directions include automated prompt optimization, fine‑tuning smaller LLMs on span‑annotation data to close the gap between cost and performance, and extending the benchmark to multilingual settings.

Authors

Alphaeus Dmonte
Roland Oruche
Tharindu Ranasinghe
Marcos Zampieri
Prasad Calyam

Paper Information

arXiv ID: 2601.00736v1
Categories: cs.CL, cs.AI
Published: January 2, 2026
PDF: Download PDF

[Paper] Exploring the Performance of Large Language Models on Subjective Span Identification Tasks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] Memory Bank Compression for Continual Adaptation of Large Language Models

[Paper] TeleDoCTR: Domain-Specific and Contextual Troubleshooting for Telecommunications

[Paper] Fast-weight Product Key Memory