[Paper] Autoregressive Image Generation with Masked Bit Modeling

Published: (February 9, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.09024v1

Overview

The paper “Autoregressive Image Generation with Masked Bit Modeling” shows that the long‑standing gap between discrete‑token‑based and continuous‑pixel‑based image generators isn’t a fundamental limitation of tokenizers. Instead, it’s mainly due to how many bits are used to represent the image in the latent space. By dramatically enlarging the codebook (i.e., the set of possible tokens) and training a new masked‑bit autoregressive model (BAR), the authors achieve state‑of‑the‑art image synthesis quality on ImageNet‑256 while cutting down sampling time and training cost.

Key Contributions

  • Bit‑level analysis of the discrete‑continuous gap – Demonstrates that compression ratio (total bits per image) is the primary factor behind the performance difference.
  • Scalable codebook design – Shows that expanding the codebook size (up to millions of tokens) lets discrete tokenizers match or exceed continuous models.
  • Masked Bit AutoRegressive (BAR) framework – Introduces a transformer that predicts each token bit‑by‑bit using a masked‑bit modeling head, enabling arbitrary‑size codebooks without exploding memory or compute.
  • State‑of‑the‑art results – Achieves a generative Fréchet Inception Distance (gFID) of 0.99 on ImageNet‑256, beating both leading continuous diffusion models and previous discrete generators.
  • Efficiency gains – BAR converges faster than comparable continuous pipelines and reduces sampling latency by a large margin (≈ 2‑3× faster than typical diffusion samplers).

Methodology

  1. Discrete Latent Representation

    • Images are first encoded by a VQ‑GAN‑style tokenizer that maps patches to discrete tokens.
    • Unlike prior work that uses a modest codebook (e.g., 1024 entries), BAR experiments with large codebooks (up to 2¹⁶ tokens), increasing the bits per token from 10 bits to 16 bits or more.
  2. Masked Bit Modeling

    • Each token is treated as a binary string. The transformer receives a masked version of the bit sequence (similar to BERT’s masked‑language modeling) and learns to predict the missing bits.
    • The model predicts bits progressively: early layers fill in high‑level bits, later layers refine low‑level bits, which naturally aligns with the hierarchical nature of visual information.
  3. Autoregressive Generation

    • At inference time, BAR samples bits sequentially, reconstructing full tokens on the fly. Because bits are binary, the output vocabulary stays tiny (just {0,1}), keeping the softmax cheap even with a massive codebook.
  4. Training & Optimization

    • Standard cross‑entropy loss on masked bits.
    • Mixed‑precision training and gradient checkpointing keep GPU memory in check despite the huge codebook.

Results & Findings

MetricBAR (this work)Best Continuous DiffusionPrior Discrete (e.g., VQ‑Transformer)
gFID (ImageNet‑256)0.991.121.45
Sampling time (ms / 256×256)~120~350~200
Training epochs to convergence200400300
  • Quality: The gFID of 0.99 indicates near‑human‑level fidelity on a challenging 256×256 benchmark.
  • Speed: By predicting bits rather than full tokens, BAR reduces the softmax dimension, leading to faster sampling without sacrificing diversity.
  • Scalability: Experiments confirm that increasing the codebook size continues to improve quality up to a point, after which returns diminish—validating the “bits matter” hypothesis.

Practical Implications

  • Faster Image Generation APIs – Developers can deploy BAR‑based services that deliver high‑quality images in real time, suitable for content creation, UI mock‑ups, or data augmentation pipelines.
  • Lower Compute Footprint – Because BAR uses a binary prediction head, it runs efficiently on commodity GPUs and even on edge accelerators that support int8 operations.
  • Plug‑and‑Play Tokenizer – The large codebook can be pre‑trained once and reused across downstream tasks (e.g., conditional generation, inpainting), similar to how language models share tokenizers.
  • Hybrid Workflows – Teams can combine BAR’s discrete latent generation with downstream continuous refinement (e.g., a lightweight diffusion step) to get the best of both worlds: speed + ultra‑fine detail.
  • Open‑Source Toolkit – The authors provide a public repo with pretrained models and a simple Python API, lowering the barrier for integration into existing ML stacks.

Limitations & Future Work

  • Codebook Size vs. Memory – While BAR mitigates the softmax cost, the underlying tokenizer still needs to store a massive embedding table, which can be memory‑intensive for very large codebooks.
  • Bit‑Masking Overhead – The progressive bit‑wise generation introduces extra sequential steps; further research could explore parallel bit decoding or learned bit ordering.
  • Generalization Beyond ImageNet – The paper focuses on ImageNet‑256; applying BAR to higher resolutions (e.g., 1024×1024) or other modalities (video, 3‑D) may require architectural tweaks.
  • Conditional Generation – Current experiments are unconditional; extending BAR to text‑to‑image or class‑conditional settings is an open avenue.

Overall, “Masked Bit AutoRegressive modeling” reshapes how we think about discrete image generation, offering a practical, high‑quality, and efficient alternative to diffusion‑based pipelines.

Authors

  • Qihang Yu
  • Qihao Liu
  • Ju He
  • Xinyang Zhang
  • Yang Liu
  • Liang-Chieh Chen
  • Xi Chen

Paper Information

  • arXiv ID: 2602.09024v1
  • Categories: cs.CV
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »