[Paper] Estimating Text Temperature

Published: (January 5, 2026 at 01:09 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02320v1

Overview

The paper introduces a simple yet powerful technique for inferring the “temperature” that was (or could have been) used to generate any piece of text, given a reference language model. By treating temperature as a hidden parameter and estimating it via maximum‑likelihood, the author shows we can measure how “random” or “deterministic” a text is—even for human‑written passages. This opens the door to quantitative analyses of writing style, model behavior, and dataset composition.

Key Contributions

  • Temperature‑estimation algorithm: A maximum‑likelihood procedure that recovers the temperature parameter for arbitrary text with respect to a chosen autoregressive LM.
  • Comprehensive evaluation: Benchmarks the estimator across a spectrum of small‑to‑medium LLMs (e.g., LLaMA‑2, Mistral, Qwen‑3) to identify which models provide the most reliable temperature signals.
  • Large‑scale corpus analysis: Applies the best‑performing model (Qwen‑3 14B) to estimate temperatures for several well‑known corpora (Wikipedia, Reddit, news articles, literary works, etc.).
  • Open‑source tooling: Releases the estimation code and scripts, enabling the community to plug in any compatible transformer model.

Methodology

  1. Problem framing – Temperature (T) scales the logits of a language model before the softmax step:
    [ p_i(T) = \frac{\exp!\left(\frac{z_i}{T}\right)}{\sum_j \exp!\left(\frac{z_j}{T}\right)} ]
    where (z_i) are the raw logits. The goal is to find the (T) that maximizes the likelihood of a given token sequence under a fixed model.

  2. Maximum‑likelihood estimation (MLE) – For a text (x = (x_1,\dots,x_n)) the log‑likelihood as a function of (T) is:
    [ \mathcal{L}(T) = \sum_{t=1}^{n} \log p_{x_t}!\bigl(T \mid x_{<t}\bigr) ]
    The estimator searches for (\hat{T} = \arg\max_T \mathcal{L}(T)) using a bounded scalar optimizer (e.g., Brent’s method) over a sensible range (e.g., (0.1 \le T \le 5)).

  3. Model selection – The author runs the estimator on synthetic texts generated at known temperatures (0.5, 1.0, 1.5, …) for each candidate LM. The model whose estimated temperatures most closely match the ground truth (lowest mean absolute error) is deemed the most “temperature‑sensitive”.

  4. Corpus‑level analysis – With the chosen model (Qwen‑3 14B), the estimator processes each document in a target corpus, aggregates the per‑document (\hat{T}) values, and reports distribution statistics (mean, median, variance).

The entire pipeline is lightweight: it only requires forward passes through the LM and a scalar optimization per document, making it feasible for millions of sentences on a single GPU.

Results & Findings

Model (size)MAE on synthetic texts (known T)Preferred range
Qwen‑3 14B0.070.2 – 3.0
LLaMA‑2 13B0.120.3 – 4.0
Mistral‑7B0.150.4 – 5.0
TinyLlama 1.1B0.230.5 – 6.0
  • Qwen‑3 14B consistently produced the smallest error, indicating that its logits retain a clear temperature signal.
  • When applied to real corpora, the estimated temperature distributions revealed intuitive patterns:
    • Wikipedia – low temperatures (median ≈ 0.45), reflecting highly predictable, factual prose.
    • Reddit comments – higher temperatures (median ≈ 1.2), matching the informal, varied style.
    • Literary novels – a bimodal shape (peaks around 0.6 and 1.4), suggesting a mix of narrative exposition and creative dialogue.
  • The estimator also distinguished human‑written vs. model‑generated text: synthetic samples at (T=1.0) were reliably identified, while human text clustered around lower temperatures but with a broader spread.

Practical Implications

  • Dataset curation: Developers can automatically flag overly deterministic or overly noisy samples, helping balance training data for fine‑tuning LLMs.
  • Model debugging: If a deployed model’s outputs drift toward unexpectedly high or low temperatures, the estimator can surface the shift before users notice quality degradation.
  • Style transfer & controllable generation: By measuring the temperature of a target style (e.g., news vs. chat), developers can set an appropriate temperature at inference to mimic that style more faithfully.
  • Human‑vs‑AI detection: Temperature estimates add a quantitative feature to classifiers that aim to detect AI‑generated content, complementing perplexity‑based signals.
  • Evaluation benchmarking: Researchers can report the “effective temperature” of benchmark datasets, making comparisons across papers more transparent.

Limitations & Future Work

  • Model dependency: Temperature estimates are only as reliable as the reference LM; a model that under‑fits the data may produce biased (\hat{T}) values.
  • Single‑parameter assumption: Real text may exhibit non‑uniform randomness across sections (e.g., dialogue vs. exposition); a single global temperature may oversimplify such heterogeneity.
  • Computational cost at scale: While lightweight per document, processing massive corpora still requires GPU resources; future work could explore amortized or batch‑wise estimation.
  • Extension to other decoding knobs: The paper focuses on temperature; extending the framework to top‑(k), nucleus sampling ((p)), or repetition penalties would broaden its applicability.

The author suggests exploring temperature‑aware fine‑tuning (training models to adapt their internal temperature dynamically) and cross‑model calibration to make temperature estimates comparable across different architectures.

Authors

  • Nikolay Mikhaylovskiy

Paper Information

  • arXiv ID: 2601.02320v1
  • Categories: cs.CL
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »