[Paper] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Published: (November 25, 2025 at 11:14 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21038v1

Overview

This paper investigates whether large language models (LLMs) can re‑learn the meaning of class labels when given a few examples that deliberately flip those labels. By treating in‑context learning (ICL) as a prompt‑driven classifier, the authors show that small open‑source LMs (1 – 12 B parameters) are anchored to the semantics they acquired during pre‑training and cannot “override” those semantics with a few-shot prompt.

Key Contributions

  • Semantic‑anchor hypothesis: Proposes that ICL mainly projects inputs onto pre‑trained semantic directions rather than remapping label meanings.
  • Three alignment metrics: Introduces truth alignment, prior alignment, and prompt alignment to dissect how a model’s predictions relate to ground‑truth, its zero‑shot bias, and the supplied demonstrations.
  • Semantic Override Rate (SOR): Defines a new metric that measures how often a model correctly follows flipped label semantics.
  • Empirical study: Evaluates eight classification tasks across eight open‑source LLMs (1 – 12 B parameters) with both natural and inverted demonstrations.
  • Negative result: Finds SOR = 0 for all tested models in the few‑shot regime, confirming that small LMs cannot learn anti‑semantic classifiers via prompting alone.

Methodology

  1. Prompt‑induced classification: Each task is framed as a text‑completion problem where the model receives a few demonstration examples followed by a test input.
  2. Natural vs. inverted demos:
    • Natural demos use the correct label mapping (e.g., “spam → 1”).
    • Inverted demos systematically swap the label meanings (e.g., “spam → 0”).
  3. Alignment decomposition:
    • Truth alignment – agreement with the true label.
    • Prior alignment – agreement with the model’s zero‑shot prediction (its built‑in bias).
    • Prompt alignment – agreement with the label indicated by the prompt.
  4. Semantic Override Rate (SOR): Calculated as the proportion of test instances where the model’s prediction matches the flipped label semantics (i.e., follows the inverted prompt).
  5. Experiments: Conducted on eight benchmark classification datasets (sentiment, topic, intent, etc.) using eight open‑source LLMs ranging from 1 B to 12 B parameters, each evaluated with 1‑shot and 5‑shot prompts.

Results & Findings

Model sizeNatural demos – accuracy boostPrior alignmentPrompt alignment (inverted)SOR
1 B+3–5 % over zero‑shotHigh (≈80 %)↑ modestly, but at cost of accuracy0 %
3 B‑12 B+4–9 % over zero‑shotHigh (≈85 %)↑ only when accuracy collapses0 %
  • Natural demos improve overall accuracy while the model’s predictions remain strongly aligned with its pre‑trained prior; most correct answers are identical to zero‑shot outputs.
  • Inverted demos never produce a coherent anti‑semantic classifier: the model can increase prompt alignment only by sacrificing truth alignment, resulting in a semantic override rate of zero across all sizes and tasks.
  • The findings hold consistently across tasks, shot numbers, and model scales up to 12 B parameters.

Practical Implications

  • Prompt engineering limits: For small to medium LLMs, you can’t rely on a few examples to completely flip a model’s notion of a label (e.g., redefining “positive” as “negative”). Prompt design should focus on clarifying the task rather than re‑labeling it.
  • Zero‑shot bias awareness: Since ICL largely leans on the model’s existing priors, developers should inspect zero‑shot behavior first; a strong bias can dominate even with several demonstrations.
  • Fine‑tuning vs. prompting: To truly change label semantics (e.g., custom taxonomies, domain‑specific categories), you’ll need lightweight fine‑tuning, adapters, or retrieval‑augmented methods rather than pure few‑shot prompting.
  • Safety & alignment: The inability to override semantics with few‑shot prompts may be a double‑edged sword—good for preventing accidental label hijacking, but limiting for rapid customization in low‑resource settings.

Limitations & Future Work

  • Model scale: Experiments stop at 12 B parameters; it remains open whether much larger LLMs (e.g., 70 B+) can achieve non‑zero SOR.
  • Task diversity: Only classification tasks were examined; generative or multi‑label settings might exhibit different behavior.
  • Prompt formats: The study uses a fixed demonstration template; richer prompting strategies (chain‑of‑thought, self‑consistency) were not explored.
  • Future directions: Extending the analysis to instruction‑tuned models, probing the effect of retrieval‑augmented prompts, and investigating how modest parameter‑efficient fine‑tuning interacts with the semantic‑anchor phenomenon.

Authors

  • Anantha Padmanaban Krishna Kumar

Paper Information

  • arXiv ID: 2511.21038v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »