[Paper] Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer

Published: (May 6, 2026 at 01:42 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.05176v1

Overview

A new paper dives deep into why large language models (LLMs) can learn on the fly from examples that appear in the prompt—a phenomenon called in‑context learning (ICL). While most prior work explained ICL for simple linear tasks, the authors extend the theory to nonlinear regression and show how a transformer’s attention heads can act as powerful feature generators (think polynomial or spline bases). The result is a concrete, mathematically‑backed picture of how LLMs can fit complex curves without ever updating their weights.

Key Contributions

  • Explicit construction of transformer attention as a feature extractor for nonlinear bases (polynomials, splines, etc.).
  • Generalization‑error analysis for end‑to‑end in‑context nonlinear regression, yielding finite‑sample bounds that depend on prompt length and the size of the pre‑training dataset.
  • Unified framework that bridges the gap between classic nonparametric regression theory and modern transformer architectures.
  • Empirical validation on synthetic regression benchmarks that confirm the theoretical predictions.

Methodology

  1. Feature‑by‑attention design – The authors design attention patterns that compute classic basis functions (e.g., (x^k) for polynomials) directly from the token embeddings. By stacking a few such heads, the transformer builds a rich nonlinear feature space.
  2. In‑context regression pipeline – Given a prompt containing ((x_i, y_i)) pairs, the model first maps each (x_i) into the constructed feature vector via attention, then performs a simple linear read‑out (a final linear layer) to predict the target for a new query (x_{\text{new}}).
  3. Theoretical analysis – Using tools from statistical learning theory (Rademacher complexity, covering numbers) they bound the expected squared error of the predictor as a function of:
    • (n) – number of examples in the prompt (context length)
    • (m) – size of the pre‑training corpus that the transformer was exposed to
    • The smoothness/complexity of the target function (captured by the chosen basis).
  4. Synthetic experiments – They generate data from known nonlinear functions (e.g., cubic polynomials, spline‑generated curves) and compare the transformer’s in‑context predictions against the theoretical error curves.

Results & Findings

  • Error scales as (O(1/n)) for well‑specified bases, matching classical nonparametric regression rates.
  • Pre‑training size matters: larger (m) reduces the constant factor in the bound, confirming that a richer pre‑training corpus improves the quality of the learned attention‑based features.
  • Feature richness vs. prompt length trade‑off: using higher‑degree polynomial bases yields lower bias but requires longer prompts to keep variance under control.
  • Empirical curves line up with theory: on synthetic tasks, the observed mean‑squared error follows the predicted decay, validating the analytical framework.

Practical Implications

  • Prompt engineering becomes principled – Knowing that attention can synthesize polynomial or spline features suggests that structuring prompts to expose the right range of input values (e.g., covering the domain uniformly) will improve ICL performance.
  • Lightweight fine‑tuning alternatives – For regression‑type tasks (e.g., time‑series forecasting, parameter estimation), developers can rely on in‑context learning instead of costly gradient‑based fine‑tuning, provided the prompt is long enough.
  • Design of custom transformers – Model architects can deliberately allocate attention heads to compute specific basis functions, yielding “feature‑aware” LLMs that are more sample‑efficient for scientific or engineering domains.
  • Interpretability – Viewing attention as a featurizer opens up new debugging tools: by inspecting the attention weights, one can infer which basis functions the model is emphasizing for a given prompt.

Limitations & Future Work

  • Synthetic focus – Experiments are limited to controlled regression datasets; real‑world noisy data may introduce additional challenges (e.g., outliers, heteroscedasticity).
  • Fixed basis families – The construction assumes the analyst knows a suitable basis (polynomial, spline). Extending the theory to learn the basis adaptively from data remains open.
  • Scalability of context length – The error bounds improve with longer prompts, but current API limits (e.g., token windows) restrict how many examples can be fed in practice.
  • Beyond regression – The paper hints at classification or structured prediction tasks, but a formal treatment for those settings is left for future research.

Bottom line: By demystifying how transformers turn attention into a universal feature generator, this work gives developers a concrete lens to view and harness in‑context learning for nonlinear problems—bridging the gap between theory and the day‑to‑day practice of building AI‑powered applications.

Authors

  • Alexander Hsu
  • Zhaiming Shen
  • Wenjing Liao
  • Rongjie Lai

Paper Information

  • arXiv ID: 2605.05176v1
  • Categories: cs.LG, math.NA
  • Published: May 6, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...