[Paper] Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models

Published: 3 days ago (February 9, 2026 at 01:33 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.08984v1

Overview

The paper introduces Next Concept Prediction (NCP), a new pre‑training objective that sits on top of the classic next‑token prediction (NTP) used by most large language models. Instead of only guessing the next word, the model also predicts a discrete “concept” that can span several tokens (e.g., a phrase, an entity, or a recurring pattern). By forcing the model to learn these higher‑level units, the authors show that language models become more expressive and achieve consistent gains across a wide range of downstream tasks.

Key Contributions

Next Concept Prediction (NCP): a novel pre‑training task that predicts multi‑token concepts in addition to the usual next‑token objective.
ConceptLM architecture: integrates vector‑quantized latent representations to build a compact “concept vocabulary” and uses the predicted concept to steer token generation.
Scalable training: experiments from 70 M up to 1.5 B parameters, trained on up to 300 B tokens (including the Pythia and GPT‑2 data pipelines).
Empirical gains: consistent improvements on 13 benchmark datasets (e.g., language understanding, reasoning, and generation tasks).
Continual pre‑training proof‑point: applying NCP on top of an already‑trained 8 B‑parameter LLaMA model yields additional performance boosts, demonstrating compatibility with existing models.

Methodology

Quantizing hidden states – The model’s continuous hidden vectors are passed through a vector‑quantization (VQ) layer, which maps each vector to the nearest entry in a learned codebook. Each codebook entry becomes a concept token.
Building a concept vocabulary – By clustering similar hidden states across the training corpus, the VQ codebook captures recurring multi‑token patterns (e.g., “New York City”, “machine learning”, common idioms).
Dual‑objective training – During pre‑training, the model simultaneously:
- Predicts the next word (standard NTP).
- Predicts the next concept token from the codebook (NCP).
  The loss from both heads is summed, encouraging the network to learn both fine‑grained lexical knowledge and coarse‑grained semantic chunks.
Guided token generation – At inference time, the predicted concept token is fed back into the decoder, providing a high‑level “hint” that conditions the subsequent token predictions.

The overall pipeline is simple enough to plug into existing transformer codebases: replace or augment the language modeling head with a VQ layer and an extra classification head for concepts.

Results & Findings

Benchmark performance – Across 13 diverse tasks (including GLUE, SuperGLUE, and zero‑shot generation benchmarks), ConceptLM outperformed token‑only baselines by 1–4 % absolute on average, with larger gains on tasks that benefit from phrase‑level understanding (e.g., entity recognition, commonsense reasoning).
Scaling behavior – The relative improvement grows with model size and data volume; the 1.5 B‑parameter ConceptLM shows the biggest jump over its token‑only counterpart.
Continual pre‑training – Adding an NCP stage to an already‑trained 8 B LLaMA model yields +0.8 % on average across the same benchmark suite, confirming that NCP can be used as a “boost” after the fact.
Analysis of learned concepts – Visualizations reveal that many codebook entries correspond to semantically coherent units (named entities, technical terms, idioms), suggesting that the model is indeed capturing higher‑level structure.

Practical Implications

Better few‑shot and zero‑shot performance – By internalizing multi‑token concepts, models can generalize from fewer examples, which is valuable for developers building applications with limited labeled data.
More efficient prompting – The concept token can act as a concise “guide” for downstream generation, potentially reducing prompt length and improving controllability.
Compatibility with existing pipelines – Since NCP is an additional loss term, teams can fine‑tune or continue‑pre‑train their current models without rebuilding the entire architecture.
Potential for compression – The discrete concept vocabulary offers a natural way to compress model knowledge (e.g., storing only the codebook and concept predictions for downstream tasks).
Enhanced interpretability – Concept tokens are human‑readable clusters, giving engineers a new lens to inspect what the model has learned (useful for debugging or bias analysis).

Limitations & Future Work

Concept granularity trade‑off – A too‑small codebook may force unrelated tokens into the same concept, while a too‑large codebook can dilute the benefit and increase memory overhead. Finding the sweet spot requires empirical tuning.
Training overhead – The VQ layer adds extra computation and memory, modestly slowing pre‑training compared to pure token‑level models.
Domain transfer – The learned concepts are tied to the pre‑training corpus; applying NCP to highly specialized domains (e.g., legal or biomedical) may need domain‑specific codebooks.
Future directions suggested by the authors include: exploring hierarchical concept vocabularies, integrating NCP with retrieval‑augmented generation, and applying the paradigm to multimodal models (e.g., vision‑language).

Next Concept Prediction opens a practical path for developers to boost the semantic awareness of their language models without discarding existing investments. By treating multi‑token patterns as first‑class citizens during pre‑training, ConceptLM demonstrates that a modest change in objective can translate into measurable real‑world gains.

Authors

Yuliang Liu
Yunchong Song
Yixuan Wang
Kewen Ge
Alex Lamb
Qipeng Guo
Kai Chen
Bowen Zhou
Zhouhan Lin

Paper Information

arXiv ID: 2602.08984v1
Categories: cs.CL, cs.AI
Published: February 9, 2026
PDF: Download PDF

[Paper] Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] Weight Decay Improves Language Model Plasticity

[Paper] Just on Time: Token-Level Early Stopping for Diffusion Language Models

[Paper] GameDevBench: Evaluating Agentic Capabilities Through Game Development