[Paper] Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Published: 15 hours ago (December 15, 2025 at 01:10 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.13618v1

Overview

The paper Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models investigates how best to feed time information into LLMs that are fine‑tuned on event‑stream data (e.g., logs, sensor readings, user actions). By systematically comparing five different ways of turning timestamps into tokens, the authors show that the “right” representation depends on the statistical shape of the underlying time gaps, rather than there being a one‑size‑fits‑all solution.

Key Contributions

First large‑scale empirical comparison of temporal tokenization methods for LLM‑based sequence prediction.
Five distinct encodings evaluated:
1. Naïve numeric strings (e.g., "1623456789").
2. High‑precision byte‑level representations (binary‑packed scalars).
3. Human‑semantic calendar tokens (e.g., "Mon 09:45").
4. Uniform binning (fixed‑width time buckets).
5. Adaptive residual scalar quantization (dynamic bins + residual bits).
Dataset suite covering diverse temporal distributions: smooth log‑normal inter‑arrival times, heavy‑tailed spikes, periodic calendar‑driven patterns, and mixed‑modality streams.
Guidelines for matching tokenization to data characteristics, highlighting when log‑based encodings or human‑readable tokens outperform others.
Open‑source benchmark code and tokenizers to enable reproducibility and rapid experimentation.

Methodology

Data preparation – The authors curated four real‑world event streams (e‑commerce click logs, IoT sensor alerts, system audit trails, and calendar‑driven meeting records). Each dataset was annotated with precise timestamps and split into training/validation/test folds.
Tokenization pipelines – For each of the five strategies, timestamps were transformed into token sequences compatible with the base LLM’s vocabulary (a 30‑k token GPT‑NeoX model).
- Numeric strings were simply cast to decimal text.
- Byte‑level used little‑endian 64‑bit IEEE‑754 floats, then split into individual bytes.
- Calendar tokens mapped timestamps to discrete tokens like "<MON>", "<09:00>", "<PM>".
- Uniform binning divided the timeline into equal‑width intervals (e.g., 5‑minute bins) and replaced each timestamp with its bin index.
- Adaptive residual quantization first selected a coarse bin via k‑means on inter‑arrival times, then encoded the residual with a small fixed‑point suffix.
Fine‑tuning – All tokenized streams were used to fine‑tune the same LLM architecture (12‑layer decoder, 768‑dim hidden size) for next‑event prediction. Training hyper‑parameters were held constant across experiments to isolate the effect of tokenization.
Evaluation metrics – Predictive accuracy (top‑1/5), negative log‑likelihood, and calibration error were reported. Additionally, token‑level efficiency (average tokens per event) and inference latency were measured.
Statistical analysis – Paired bootstrap tests assessed significance, while correlation analyses linked distribution skewness/kurtosis to the relative performance of each encoding.

Results & Findings

Encoding	Best‑performing dataset	Accuracy Δ vs. baseline*	Tokens per event	Inference overhead
Numeric strings	Uniform‑bin dataset	+1.2 %	12	negligible
Byte‑level	High‑frequency IoT spikes	+3.8 %	9	+12 ms
Calendar tokens	Mixed‑modality calendar logs	+2.5 %	8	negligible
Uniform binning	Smooth log‑normal logs	+0.9 %	6	fastest
Adaptive residual quantization	Heavy‑tailed spiky data	+5.4 %	7	+5 ms

*Baseline = naive numeric strings on the same dataset.

No universal winner – Adaptive residual quantization shines on highly skewed, bursty streams, while human‑semantic calendar tokens are robust when the data contains periodic, human‑oriented patterns.
Token efficiency matters – Strategies that compress timestamps into fewer tokens (uniform binning, calendar tokens) reduce latency without sacrificing accuracy on well‑behaved distributions.
Alignment with distribution – A simple statistical check (e.g., skewness > 2) can predict when adaptive quantization will outperform simpler schemes.

Practical Implications

LLM‑powered log analytics – Engineers can swap in a calendar‑tokenizer for system logs that contain business‑hour patterns, gaining a modest accuracy bump without extra compute.
Edge‑device forecasting – For IoT deployments with bursty sensor spikes, using byte‑level or adaptive residual encodings can improve prediction quality while keeping model size unchanged.
Rapid prototyping – The open‑source tokenizers let developers experiment with a “plug‑and‑play” approach: run a quick distribution analysis on a new event stream, then select the matching encoding per the paper’s guidelines.
Cost‑aware inference – Fewer tokens per event translate directly into lower API usage fees on hosted LLM services; uniform binning or calendar tokens are attractive when latency or cost is a primary concern.

Limitations & Future Work

Model scale – Experiments were limited to a 12‑layer, 770 M‑parameter decoder; results may shift with larger, instruction‑tuned LLMs.
Single‑modal focus – The study only examined timestamp + categorical event payloads; multimodal streams (e.g., text + time) were not explored.
Static tokenizers – All encodings were fixed after preprocessing; dynamic, context‑aware tokenization (e.g., learned embeddings for time) remains an open avenue.
Real‑time adaptation – Future work could investigate online adjustment of quantization bins as the temporal distribution drifts in production environments.

Bottom line: Choosing the right temporal tokenization is as important as model architecture when building LLM‑driven event predictors. By matching the encoding to the data’s time‑distribution, developers can squeeze out measurable gains in accuracy, efficiency, and cost.

Authors

Zefang Liu
Nam Nguyen
Yinzhu Quan
Austin Zhang

Paper Information

arXiv ID: 2512.13618v1
Categories: cs.CL, cs.LG
Published: December 15, 2025
PDF: Download PDF

[Paper] Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Large-Language Memorization During the Classification of United States Supreme Court Cases

[Paper] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

[Paper] Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization

[Paper] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding