[Paper] Tight Sample Complexity of Transformers

Published: 3 days ago (June 8, 2026 at 12:56 PM EDT)

1 min read

Source: arXiv

Source: arXiv - 2606.09731v1

Overview

We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.

Key Contributions

This paper presents research in the following areas:

cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Chenxiao Yang
Nathan Srebro
Zhiyuan Li

Paper Information

arXiv ID: 2606.09731v1
Categories: cs.LG
Published: June 8, 2026
PDF: Download PDF

[Paper] Tight Sample Complexity of Transformers

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?