[Paper] Modeling Language as a Sequence of Thoughts
Source: arXiv - 2512.25026v1
Overview
The Thought Gestalt (TG) model re‑imagines how large language models (LLMs) process text by introducing a second, higher‑level “thought” representation for each sentence. By coupling token‑level generation with a recurrent memory of sentence‑level embeddings, TG achieves better data efficiency and more consistent handling of relational information—addressing well‑known brittleness issues in standard Transformers such as the “reversal curse.”
Key Contributions
- Dual‑level architecture – a recurrent Transformer that simultaneously learns token embeddings and compact sentence‑level “thought” vectors using the same parameters.
- Cross‑attention memory – each new sentence attends to a growing memory of prior sentence representations, enabling long‑range contextual grounding without exploding model size.
- Unified training objective – the model is trained solely with next‑token cross‑entropy; gradients flow back through the memory, automatically shaping the quality of thought vectors.
- Efficiency gains – empirical scaling shows TG matches or outperforms GPT‑2 baselines while needing ~5‑8 % less data and ~33‑42 % fewer parameters for comparable loss.
- Improved relational reasoning – TG reduces errors on a father‑son reversal probe, demonstrating more robust handling of entity relations across sentences.
Methodology
-
Two‑tier representation
- Token tier: identical to a standard Transformer decoder, generating one token at a time.
- Thought tier: after a sentence finishes, the model aggregates its token hidden states into a single “thought” vector (a gestalt of the sentence’s meaning).
-
Recurrent memory
- Thought vectors are stored in a FIFO‑style memory.
- When generating the next sentence, the token decoder cross‑attends to all previous thought vectors, allowing it to retrieve high‑level context without revisiting every token.
-
Parameter sharing
- The same Transformer layers produce both token and thought embeddings, keeping the parameter count low.
-
Training
- Standard next‑token cross‑entropy loss.
- Because the computation graph of each thought vector is retained, loss gradients from future tokens propagate back through the cross‑attention to improve earlier thought representations automatically.
-
Scaling experiments
- TG was benchmarked against GPT‑2 of comparable size on language modeling corpora.
- Loss curves were fitted to estimate data and parameter “equivalence” between the two families.
Results & Findings
| Metric | TG (baseline size) | Matched GPT‑2 |
|---|---|---|
| Per‑token loss | 0.92 | 0.97 |
| Data needed for same loss | 1× (baseline) | ~1.05‑1.08× |
| Parameters needed for same loss | 1× (baseline) | ~1.33‑1.42× |
| Reversal‑curse error (father‑son probe) | 12 % | 23 % |
- Efficiency: TG reaches a given perplexity with roughly 5‑8 % less training data and up to 42 % fewer parameters.
- Relational consistency: The model’s thought memory helps preserve entity roles across sentences, cutting reversal‑curse errors by about half.
- Scalability: Loss scaling curves suggest TG’s advantage persists as model size grows, hinting at favorable returns for larger deployments.
Practical Implications
- More compact LLMs – Developers can achieve GPT‑2‑level quality with smaller models, reducing GPU memory footprints and inference latency—critical for edge or real‑time applications.
- Better long‑range coherence – Applications that generate multi‑sentence narratives (e.g., chatbots, story generators, documentation assistants) will benefit from the persistent “thought” memory, leading to fewer contradictions and improved entity tracking.
- Data‑efficient fine‑tuning – Because TG learns richer sentence‑level abstractions, it can adapt to new domains with fewer examples, lowering the cost of domain‑specific language models.
- Improved reasoning probes – The reduction in reversal‑curse errors suggests TG could serve as a stronger backbone for downstream tasks that require relational reasoning, such as knowledge‑base question answering or instruction following.
Limitations & Future Work
- Memory growth – The thought memory grows linearly with the number of sentences, which may become a bottleneck for very long documents; future work could explore hierarchical or compressive memory schemes.
- Evaluation scope – The paper focuses on language modeling loss and a single relational probe; broader benchmarks (e.g., GLUE, MMLU) are needed to confirm generalization.
- Sentence boundary reliance – TG assumes clear sentence delimiters; handling noisy or streaming text without explicit punctuation remains an open challenge.
- Integration with existing pipelines – Adapting TG to large‑scale pre‑training pipelines (e.g., distributed training across many GPUs) will require engineering effort to manage the cross‑attention memory efficiently.
Authors
- Nasim Borazjanizadeh
- James McClelland
Paper Information
- arXiv ID: 2512.25026v1
- Categories: cs.CL, cs.AI
- Published: December 31, 2025
- PDF: Download PDF