[Paper] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels
Source: arXiv - 2511.21038v1
Overview
This paper investigates whether large language models (LLMs) can re‑learn the meaning of class labels when given a few examples that deliberately flip those labels. By treating in‑context learning (ICL) as a prompt‑driven classifier, the authors show that small open‑source LMs (1 – 12 B parameters) are anchored to the semantics they acquired during pre‑training and cannot “override” those semantics with a few-shot prompt.
Key Contributions
- Semantic‑anchor hypothesis: Proposes that ICL mainly projects inputs onto pre‑trained semantic directions rather than remapping label meanings.
- Three alignment metrics: Introduces truth alignment, prior alignment, and prompt alignment to dissect how a model’s predictions relate to ground‑truth, its zero‑shot bias, and the supplied demonstrations.
- Semantic Override Rate (SOR): Defines a new metric that measures how often a model correctly follows flipped label semantics.
- Empirical study: Evaluates eight classification tasks across eight open‑source LLMs (1 – 12 B parameters) with both natural and inverted demonstrations.
- Negative result: Finds SOR = 0 for all tested models in the few‑shot regime, confirming that small LMs cannot learn anti‑semantic classifiers via prompting alone.
Methodology
- Prompt‑induced classification: Each task is framed as a text‑completion problem where the model receives a few demonstration examples followed by a test input.
- Natural vs. inverted demos:
- Natural demos use the correct label mapping (e.g., “spam → 1”).
- Inverted demos systematically swap the label meanings (e.g., “spam → 0”).
- Alignment decomposition:
- Truth alignment – agreement with the true label.
- Prior alignment – agreement with the model’s zero‑shot prediction (its built‑in bias).
- Prompt alignment – agreement with the label indicated by the prompt.
- Semantic Override Rate (SOR): Calculated as the proportion of test instances where the model’s prediction matches the flipped label semantics (i.e., follows the inverted prompt).
- Experiments: Conducted on eight benchmark classification datasets (sentiment, topic, intent, etc.) using eight open‑source LLMs ranging from 1 B to 12 B parameters, each evaluated with 1‑shot and 5‑shot prompts.
Results & Findings
| Model size | Natural demos – accuracy boost | Prior alignment | Prompt alignment (inverted) | SOR |
|---|---|---|---|---|
| 1 B | +3–5 % over zero‑shot | High (≈80 %) | ↑ modestly, but at cost of accuracy | 0 % |
| 3 B‑12 B | +4–9 % over zero‑shot | High (≈85 %) | ↑ only when accuracy collapses | 0 % |
- Natural demos improve overall accuracy while the model’s predictions remain strongly aligned with its pre‑trained prior; most correct answers are identical to zero‑shot outputs.
- Inverted demos never produce a coherent anti‑semantic classifier: the model can increase prompt alignment only by sacrificing truth alignment, resulting in a semantic override rate of zero across all sizes and tasks.
- The findings hold consistently across tasks, shot numbers, and model scales up to 12 B parameters.
Practical Implications
- Prompt engineering limits: For small to medium LLMs, you can’t rely on a few examples to completely flip a model’s notion of a label (e.g., redefining “positive” as “negative”). Prompt design should focus on clarifying the task rather than re‑labeling it.
- Zero‑shot bias awareness: Since ICL largely leans on the model’s existing priors, developers should inspect zero‑shot behavior first; a strong bias can dominate even with several demonstrations.
- Fine‑tuning vs. prompting: To truly change label semantics (e.g., custom taxonomies, domain‑specific categories), you’ll need lightweight fine‑tuning, adapters, or retrieval‑augmented methods rather than pure few‑shot prompting.
- Safety & alignment: The inability to override semantics with few‑shot prompts may be a double‑edged sword—good for preventing accidental label hijacking, but limiting for rapid customization in low‑resource settings.
Limitations & Future Work
- Model scale: Experiments stop at 12 B parameters; it remains open whether much larger LLMs (e.g., 70 B+) can achieve non‑zero SOR.
- Task diversity: Only classification tasks were examined; generative or multi‑label settings might exhibit different behavior.
- Prompt formats: The study uses a fixed demonstration template; richer prompting strategies (chain‑of‑thought, self‑consistency) were not explored.
- Future directions: Extending the analysis to instruction‑tuned models, probing the effect of retrieval‑augmented prompts, and investigating how modest parameter‑efficient fine‑tuning interacts with the semantic‑anchor phenomenon.
Authors
- Anantha Padmanaban Krishna Kumar
Paper Information
- arXiv ID: 2511.21038v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: November 26, 2025
- PDF: Download PDF