[Paper] Layer-wise Positional Bias in Short-Context Language Modeling

Published: 1 month ago (January 7, 2026 at 12:04 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.04098v1

Overview

The paper “Layer-wise Positional Bias in Short-Context Language Modeling” uncovers how modern language models (LLMs) allocate attention to different token positions as information flows through their layers. By probing the internal dynamics of short‑context models, the authors reveal systematic “recency” and “primacy” biases that persist regardless of the actual meaning of the words—insights that matter for anyone building or fine‑tuning LLM‑based products.

Key Contributions

Attribution‑based analysis framework: Introduces a layer‑conductance method combined with a sliding‑window probe to measure the importance each layer assigns to every input position.
Architecture‑specific positional profiles: Shows that the shape of the bias (how much weight is given to recent vs. early tokens) is stable across inputs and differs between model families (e.g., GPT‑style vs. BERT‑style).
Depth‑dependent bias trends: Finds a strong recency bias that grows with layer depth and a subtle primacy bias that diminishes as we move deeper.
Word‑type differentiation: Early layers prioritize content words (nouns, verbs, adjectives) over function words (articles, prepositions) across all positions; this distinction fades in later layers.
Robustness to lexical scrambling: Positional importance profiles remain unchanged even when the token order is randomly shuffled, confirming that the bias is truly positional rather than semantic.

Methodology

Sliding‑window probing: For a given short context (e.g., 32 tokens), the authors mask all but a moving window of one token and record the model’s output probability.
Layer conductance: Using Integrated Gradients, they compute how much each layer contributes to the change in output when the window slides, yielding a per‑layer, per‑position importance score.
Aggregation: Scores are averaged over many sentences and random seeds to produce stable “positional importance profiles” for each layer.
Control experiments: They repeat the analysis on scrambled sentences and on different model architectures to isolate positional effects from lexical semantics.

The pipeline is deliberately lightweight—no retraining or heavy probing heads—making it easy to replicate on any transformer‑based LM.

Results & Findings

Observation	What the data shows	Interpretation
Recency bias ↑ with depth	Upper layers assign >60 % of importance to the last 5 tokens, even in a 32‑token window.	Deeper layers treat the most recent context as the primary signal for next‑token prediction.
Primacy bias ↓ with depth	Lower layers give a modest boost (~10 % extra weight) to the first few tokens; this advantage disappears after ~6‑8 layers.	Early processing retains a “memory” of the start of the sequence, but it is overwritten as representations become more abstract.
Content vs. function word weighting	In layers 1‑4, content words receive ~1.5× higher conductance than function words across all positions; layers 9‑12 show no distinction.	Initial layers act as lexical filters, while later layers focus on positional patterns rather than word class.
Stability across inputs & scrambling	Positional profiles have a Pearson correlation >0.9 between original and shuffled sentences.	The bias is a property of the model architecture, not of the specific sentence meaning.
Architecture differences	GPT‑style (decoder‑only) models exhibit a steeper recency curve than encoder‑only BERT‑style models.	Design choices (causal masking vs. bidirectional attention) shape how positional information is propagated.

Practical Implications

Prompt engineering: Knowing that deeper layers heavily favor recent tokens suggests that placing critical instructions or context at the end of a prompt can improve model compliance, especially for decoder‑only LMs.
Fine‑tuning strategies: When adapting a model to tasks that require long‑range dependencies (e.g., document summarization), consider adding auxiliary loss terms or adapters that explicitly boost primacy signals in higher layers.
Model debugging: Unexpected output quirks (e.g., “forgetting” early context) can now be traced to the natural attenuation of primacy bias, guiding developers to inspect or re‑weight early‑layer activations.
Architecture selection: For applications where the beginning of a sequence carries essential metadata (e.g., API keys, user IDs), encoder‑only or hybrid models may retain that information better than pure causal decoders.
Efficiency optimizations: Since later layers contribute little beyond recent tokens, one could truncate the context window for high‑depth inference without a large accuracy hit, saving compute in latency‑sensitive services.

Limitations & Future Work

Short‑context focus: Experiments are limited to windows ≤ 64 tokens; it remains unclear how the identified biases scale to truly long‑context models (e.g., 4k‑token LLaMA).
Single‑task evaluation: The analysis centers on next‑token prediction; other downstream tasks (e.g., classification, generation with beam search) might exhibit different bias dynamics.
Model families: Only a handful of popular transformer variants were examined; newer architectures (e.g., Retrieval‑augmented or Mixture‑of‑Experts models) could behave differently.
Causal attribution: Integrated Gradients provide an approximation of layer importance; alternative attribution methods might yield finer‑grained insights.

Future work could extend the framework to multi‑modal models, explore bias mitigation techniques (e.g., positional regularization), and investigate how training objectives (masked vs. causal) shape the evolution of positional bias across depth.

Authors

Maryam Rahimi
Mahdi Nouri
Yadollah Yaghoobzadeh

Paper Information

arXiv ID: 2601.04098v1
Categories: cs.CL, cs.AI
Published: January 7, 2026
PDF: Download PDF

[Paper] Layer-wise Positional Bias in Short-Context Language Modeling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[Paper] Can We Predict Before Executing Machine Learning Agents?

[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency