[Paper] Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models
Source: arXiv - 2602.08984v1
Overview
The paper introduces Next Concept Prediction (NCP), a new pre‑training objective that sits on top of the classic next‑token prediction (NTP) used by most large language models. Instead of only guessing the next word, the model also predicts a discrete “concept” that can span several tokens (e.g., a phrase, an entity, or a recurring pattern). By forcing the model to learn these higher‑level units, the authors show that language models become more expressive and achieve consistent gains across a wide range of downstream tasks.
Key Contributions
- Next Concept Prediction (NCP): a novel pre‑training task that predicts multi‑token concepts in addition to the usual next‑token objective.
- ConceptLM architecture: integrates vector‑quantized latent representations to build a compact “concept vocabulary” and uses the predicted concept to steer token generation.
- Scalable training: experiments from 70 M up to 1.5 B parameters, trained on up to 300 B tokens (including the Pythia and GPT‑2 data pipelines).
- Empirical gains: consistent improvements on 13 benchmark datasets (e.g., language understanding, reasoning, and generation tasks).
- Continual pre‑training proof‑point: applying NCP on top of an already‑trained 8 B‑parameter LLaMA model yields additional performance boosts, demonstrating compatibility with existing models.
Methodology
- Quantizing hidden states – The model’s continuous hidden vectors are passed through a vector‑quantization (VQ) layer, which maps each vector to the nearest entry in a learned codebook. Each codebook entry becomes a concept token.
- Building a concept vocabulary – By clustering similar hidden states across the training corpus, the VQ codebook captures recurring multi‑token patterns (e.g., “New York City”, “machine learning”, common idioms).
- Dual‑objective training – During pre‑training, the model simultaneously:
- Predicts the next word (standard NTP).
- Predicts the next concept token from the codebook (NCP).
The loss from both heads is summed, encouraging the network to learn both fine‑grained lexical knowledge and coarse‑grained semantic chunks.
- Guided token generation – At inference time, the predicted concept token is fed back into the decoder, providing a high‑level “hint” that conditions the subsequent token predictions.
The overall pipeline is simple enough to plug into existing transformer codebases: replace or augment the language modeling head with a VQ layer and an extra classification head for concepts.
Results & Findings
- Benchmark performance – Across 13 diverse tasks (including GLUE, SuperGLUE, and zero‑shot generation benchmarks), ConceptLM outperformed token‑only baselines by 1–4 % absolute on average, with larger gains on tasks that benefit from phrase‑level understanding (e.g., entity recognition, commonsense reasoning).
- Scaling behavior – The relative improvement grows with model size and data volume; the 1.5 B‑parameter ConceptLM shows the biggest jump over its token‑only counterpart.
- Continual pre‑training – Adding an NCP stage to an already‑trained 8 B LLaMA model yields +0.8 % on average across the same benchmark suite, confirming that NCP can be used as a “boost” after the fact.
- Analysis of learned concepts – Visualizations reveal that many codebook entries correspond to semantically coherent units (named entities, technical terms, idioms), suggesting that the model is indeed capturing higher‑level structure.
Practical Implications
- Better few‑shot and zero‑shot performance – By internalizing multi‑token concepts, models can generalize from fewer examples, which is valuable for developers building applications with limited labeled data.
- More efficient prompting – The concept token can act as a concise “guide” for downstream generation, potentially reducing prompt length and improving controllability.
- Compatibility with existing pipelines – Since NCP is an additional loss term, teams can fine‑tune or continue‑pre‑train their current models without rebuilding the entire architecture.
- Potential for compression – The discrete concept vocabulary offers a natural way to compress model knowledge (e.g., storing only the codebook and concept predictions for downstream tasks).
- Enhanced interpretability – Concept tokens are human‑readable clusters, giving engineers a new lens to inspect what the model has learned (useful for debugging or bias analysis).
Limitations & Future Work
- Concept granularity trade‑off – A too‑small codebook may force unrelated tokens into the same concept, while a too‑large codebook can dilute the benefit and increase memory overhead. Finding the sweet spot requires empirical tuning.
- Training overhead – The VQ layer adds extra computation and memory, modestly slowing pre‑training compared to pure token‑level models.
- Domain transfer – The learned concepts are tied to the pre‑training corpus; applying NCP to highly specialized domains (e.g., legal or biomedical) may need domain‑specific codebooks.
- Future directions suggested by the authors include: exploring hierarchical concept vocabularies, integrating NCP with retrieval‑augmented generation, and applying the paradigm to multimodal models (e.g., vision‑language).
Next Concept Prediction opens a practical path for developers to boost the semantic awareness of their language models without discarding existing investments. By treating multi‑token patterns as first‑class citizens during pre‑training, ConceptLM demonstrates that a modest change in objective can translate into measurable real‑world gains.
Authors
- Yuliang Liu
- Yunchong Song
- Yixuan Wang
- Kewen Ge
- Alex Lamb
- Qipeng Guo
- Kai Chen
- Bowen Zhou
- Zhouhan Lin
Paper Information
- arXiv ID: 2602.08984v1
- Categories: cs.CL, cs.AI
- Published: February 9, 2026
- PDF: Download PDF