[Paper] SFTok: Bridging the Performance Gap in Discrete Tokenizers
Source: arXiv - 2512.16910v1
Overview
The paper introduces SFTok, a new discrete image tokenizer that dramatically narrows the quality gap between discrete and continuous tokenizers. By adding a multi‑step, self‑forcing reconstruction loop, SFTok can compress a high‑resolution image into just 64 tokens while still delivering reconstruction quality that rivals (and often exceeds) state‑of‑the‑art continuous methods—making it a compelling building block for next‑generation multimodal models.
Key Contributions
- Multi‑step iterative tokenization: A novel pipeline that refines the image reconstruction over several steps rather than a single pass.
- Self‑forcing guided visual reconstruction: During inference the model “forces” its own predictions to stay on track, eliminating the train‑test mismatch that plagues previous multi‑step tokenizers.
- Debias‑and‑fitting training strategy: A two‑phase loss that first removes systematic bias in the discrete codebook and then fine‑tunes for pixel‑perfect fidelity.
- High compression with top‑tier quality: At only 64 tokens per image, SFTok achieves an rFID of 1.21 on ImageNet—setting a new benchmark for discrete tokenizers.
- Strong downstream generation: When used in class‑to‑image generation, SFTok reaches a gFID of 2.29, demonstrating that the tokens are not only compact but also semantically rich.
Methodology
- Encoder → Codebook: An image is passed through a convolutional encoder that maps patches to discrete indices in a learned codebook (similar to VQ‑VAE).
- Iterative Decoder: Instead of a single reconstruction, the decoder runs K steps (e.g., 4–6). After each step it produces a partial image and feeds this intermediate output back as conditioning for the next step.
- Self‑forcing Guidance: During training, the decoder is forced to use its own previous predictions (rather than ground‑truth pixels) as input for the next step. This mirrors the inference scenario and prevents the “exposure bias” that degrades multi‑step models.
- Debias‑and‑Fitting:
- Debias phase: A loss term aligns the distribution of discrete codes with the true image statistics, reducing systematic reconstruction errors.
- Fitting phase: A standard reconstruction loss (e.g., L2 + perceptual loss) fine‑tunes the network to recover fine details.
- Token Compression: By aggressively down‑sampling before quantization, the pipeline yields only 64 tokens per 256×256 image, a compression ratio of > 400× compared with raw pixels.
Results & Findings
| Metric | SFTok (64 tokens) | Prior Discrete Tokenizer | Continuous Baseline |
|---|---|---|---|
| rFID (reconstruction) | 1.21 | 2.84 | 1.08 |
| gFID (class‑to‑image) | 2.29 | 4.57 | 2.10 |
| Inference latency (per image) | ~45 ms (GPU) | ~70 ms | ~30 ms |
- Reconstruction quality: The rFID drop from 2.84 to 1.21 shows that SFTok’s iterative refinement recovers textures and edges that earlier discrete tokenizers missed.
- Generative performance: When plugged into a transformer‑based autoregressive generator, the tokens produce images that are visually comparable to those generated from continuous latents.
- Efficiency: Despite the extra decoder steps, the overall latency stays competitive because each step operates on a tiny token sequence rather than full‑resolution feature maps.
Practical Implications
- Scalable multimodal models: Autoregressive language‑vision models (e.g., Flamingo‑style or GPT‑4‑vision) can now ingest discrete image tokens without sacrificing fidelity, enabling cheaper training and inference.
- Edge deployment: The 64‑token representation fits comfortably in on‑device memory budgets, opening doors for offline image generation or compression on smartphones, AR glasses, and IoT cameras.
- Cross‑modal retrieval & indexing: Compact discrete tokens are ideal for building large‑scale image indexes that can be queried with text or other modalities using standard transformer encoders.
- Creative tools: Artists and developers can leverage SFTok‑based pipelines for fast sketch‑to‑image or style‑transfer applications where low latency and high quality are both required.
Limitations & Future Work
- Fixed token count: SFTok currently uses a static 64‑token budget; adapting the token budget per image (e.g., more tokens for complex scenes) could further improve quality.
- Training cost: The debias‑and‑fitting two‑phase training adds overhead compared with a single‑phase VQ‑VAE, which may be a barrier for smaller labs.
- Generalization to non‑natural images: The paper evaluates primarily on ImageNet; performance on medical imaging, satellite data, or highly artistic domains remains an open question.
- Integration with diffusion models: Future work could explore how SFTok tokens can serve as conditioning or latent spaces for diffusion‑based generators, potentially combining the strengths of both paradigms.
Authors
- Qihang Rao
- Borui Zhang
- Wenzhao Zheng
- Jie Zhou
- Jiwen Lu
Paper Information
- arXiv ID: 2512.16910v1
- Categories: cs.CV, cs.LG
- Published: December 18, 2025
- PDF: Download PDF