[Paper] Tight Sample Complexity of Transformers

Published: (June 8, 2026 at 12:56 PM EDT)
1 min read
Source: arXiv

Source: arXiv - 2606.09731v1

Overview

We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.

Key Contributions

This paper presents research in the following areas:

  • cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

  • Chenxiao Yang
  • Nathan Srebro
  • Zhiyuan Li

Paper Information

  • arXiv ID: 2606.09731v1
  • Categories: cs.LG
  • Published: June 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »