[Paper] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Published: 2 months ago (November 25, 2025 at 11:14 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21038v1

Overview

This paper investigates whether large language models (LLMs) can re‑learn the meaning of class labels when given a few examples that deliberately flip those labels. By treating in‑context learning (ICL) as a prompt‑driven classifier, the authors show that small open‑source LMs (1 – 12 B parameters) are anchored to the semantics they acquired during pre‑training and cannot “override” those semantics with a few-shot prompt.

Key Contributions

Semantic‑anchor hypothesis: Proposes that ICL mainly projects inputs onto pre‑trained semantic directions rather than remapping label meanings.
Three alignment metrics: Introduces truth alignment, prior alignment, and prompt alignment to dissect how a model’s predictions relate to ground‑truth, its zero‑shot bias, and the supplied demonstrations.
Semantic Override Rate (SOR): Defines a new metric that measures how often a model correctly follows flipped label semantics.
Empirical study: Evaluates eight classification tasks across eight open‑source LLMs (1 – 12 B parameters) with both natural and inverted demonstrations.
Negative result: Finds SOR = 0 for all tested models in the few‑shot regime, confirming that small LMs cannot learn anti‑semantic classifiers via prompting alone.

Methodology

Prompt‑induced classification: Each task is framed as a text‑completion problem where the model receives a few demonstration examples followed by a test input.
Natural vs. inverted demos:
- Natural demos use the correct label mapping (e.g., “spam → 1”).
- Inverted demos systematically swap the label meanings (e.g., “spam → 0”).
Alignment decomposition:
- Truth alignment – agreement with the true label.
- Prior alignment – agreement with the model’s zero‑shot prediction (its built‑in bias).
- Prompt alignment – agreement with the label indicated by the prompt.
Semantic Override Rate (SOR): Calculated as the proportion of test instances where the model’s prediction matches the flipped label semantics (i.e., follows the inverted prompt).
Experiments: Conducted on eight benchmark classification datasets (sentiment, topic, intent, etc.) using eight open‑source LLMs ranging from 1 B to 12 B parameters, each evaluated with 1‑shot and 5‑shot prompts.

Results & Findings

Model size	Natural demos – accuracy boost	Prior alignment	Prompt alignment (inverted)	SOR
1 B	+3–5 % over zero‑shot	High (≈80 %)	↑ modestly, but at cost of accuracy	0 %
3 B‑12 B	+4–9 % over zero‑shot	High (≈85 %)	↑ only when accuracy collapses	0 %

Natural demos improve overall accuracy while the model’s predictions remain strongly aligned with its pre‑trained prior; most correct answers are identical to zero‑shot outputs.
Inverted demos never produce a coherent anti‑semantic classifier: the model can increase prompt alignment only by sacrificing truth alignment, resulting in a semantic override rate of zero across all sizes and tasks.
The findings hold consistently across tasks, shot numbers, and model scales up to 12 B parameters.

Practical Implications

Prompt engineering limits: For small to medium LLMs, you can’t rely on a few examples to completely flip a model’s notion of a label (e.g., redefining “positive” as “negative”). Prompt design should focus on clarifying the task rather than re‑labeling it.
Zero‑shot bias awareness: Since ICL largely leans on the model’s existing priors, developers should inspect zero‑shot behavior first; a strong bias can dominate even with several demonstrations.
Fine‑tuning vs. prompting: To truly change label semantics (e.g., custom taxonomies, domain‑specific categories), you’ll need lightweight fine‑tuning, adapters, or retrieval‑augmented methods rather than pure few‑shot prompting.
Safety & alignment: The inability to override semantics with few‑shot prompts may be a double‑edged sword—good for preventing accidental label hijacking, but limiting for rapid customization in low‑resource settings.

Limitations & Future Work

Model scale: Experiments stop at 12 B parameters; it remains open whether much larger LLMs (e.g., 70 B+) can achieve non‑zero SOR.
Task diversity: Only classification tasks were examined; generative or multi‑label settings might exhibit different behavior.
Prompt formats: The study uses a fixed demonstration template; richer prompting strategies (chain‑of‑thought, self‑consistency) were not explored.
Future directions: Extending the analysis to instruction‑tuned models, probing the effect of retrieval‑augmented prompts, and investigating how modest parameter‑efficient fine‑tuning interacts with the semantic‑anchor phenomenon.

Authors

Anantha Padmanaban Krishna Kumar

Paper Information

arXiv ID: 2511.21038v1
Categories: cs.CL, cs.AI, cs.LG
Published: November 26, 2025
PDF: Download PDF

[Paper] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation