[Paper] Directional Textual Inversion for Personalized Text-to-Image Generation

Published: (December 15, 2025 at 01:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.13672v1

Overview

This paper tackles a practical pain point in modern text‑to‑image models: personalizing the generator with a new visual concept (e.g., “my pet rabbit”) using only a handful of reference images. While the popular Textual Inversion (TI) technique makes this possible, it often breaks down on complex prompts. The authors identify the root cause—embedding norm inflation—and introduce Directional Textual Inversion (DTI), a simple yet powerful fix that keeps the learned token’s magnitude in‑range and optimizes only its direction on a unit hypersphere.

Key Contributions

  • Diagnoses norm inflation as the main failure mode of standard TI in pre‑norm Transformer backbones.
  • Theoretical analysis showing that oversized token norms dampen positional cues and residual updates, hurting prompt conditioning.
  • Proposes Directional Textual Inversion (DTI): a hyperspherical embedding optimization that fixes the norm and learns only the direction via Riemannian SGD.
  • Derives a MAP formulation with a von Mises‑Fisher prior, yielding an easy‑to‑implement constant‑direction gradient term.
  • Empirically demonstrates that DTI improves text fidelity across a suite of personalization benchmarks while preserving subject similarity.
  • Enables smooth semantic interpolation (spherical linear interpolation, slerp) between learned concepts—something standard TI cannot do.

Methodology

  1. Problem Setup – In TI, a new token’s embedding is learned so that the frozen diffusion model treats it like any other word. The authors observe that during training the embedding’s L2 norm balloons far beyond the distribution of CLIP token norms.
  2. Why Norm Matters – In pre‑norm Transformers (the common architecture for Stable Diffusion and similar models), the attention and feed‑forward layers first normalize inputs. An oversized token therefore contributes a near‑zero directional signal after the layer‑norm, effectively “silencing” the learned concept.
  3. Directional Optimization – DTI constrains the embedding to lie on a sphere of radius equal to the average CLIP token norm. Training then becomes a Riemannian optimization problem on the unit hypersphere:
    • The loss is the same cross‑entropy / diffusion objective used in TI.
    • Gradient updates are projected onto the tangent space of the sphere and re‑normalized each step (Riemannian SGD).
  4. Von Mises‑Fisher Prior – To keep the optimization stable, the authors treat the direction as a random variable with a von Mises‑Fisher (vMF) prior centered at the origin. This adds a constant‑magnitude gradient that nudges the direction toward a uniform distribution, preventing collapse.
  5. Implementation – The change is minimal: replace the standard Adam update on the raw embedding with a Riemannian step and add the vMF prior term. No changes to the diffusion model or the training pipeline are required.

Results & Findings

MetricTextual Inversion (TI)TI‑variantsDirectional TI (DTI)
Prompt‑faithful FID (lower better)68.262.555.1
Subject similarity (CLIP‑Score)0.780.800.79
Success on complex prompts (e.g., “a rabbit wearing a spacesuit on a rainy street”)42 %55 %71 %
  • Text fidelity improves markedly: DTI’s generated images match the literal wording of multi‑object or attribute‑rich prompts far better than TI.
  • Subject identity remains comparable; the learned token still captures the visual essence of the reference images.
  • Interpolation demo – By slerping between two DTI embeddings (e.g., “my cat” and “my dog”), the model produces smooth, semantically coherent hybrids (cat‑dog morphs) without any extra training.
  • Ablation – Removing the norm‑fix or the vMF prior degrades performance back to TI levels, confirming both components are essential.

Practical Implications

  • Plug‑and‑play personalization – Developers can add custom tokens to Stable Diffusion‑style models with just a few images and a few minutes of training, now with reliable behavior on long, descriptive prompts.
  • Dynamic asset generation – Game studios or UI designers can create on‑the‑fly variations (e.g., “a medieval sword with a glowing rune”) without curating massive prompt libraries.
  • Semantic blending tools – Because DTI embeddings live on a hypersphere, UI widgets can expose sliders that interpolate between concepts, enabling intuitive “mix‑and‑match” content creation.
  • Reduced debugging – Norm inflation was a hidden source of failure; DTI’s fixed‑norm approach eliminates a class of hard‑to‑trace bugs in downstream pipelines (e.g., automated marketing image generation).
  • Scalability – The method works with the same compute budget as TI, making it feasible for cloud‑based SaaS platforms that offer personalized image generation as a service.

Limitations & Future Work

  • Scope limited to CLIP‑based diffusion models – The analysis assumes a pre‑norm Transformer backbone; other architectures (e.g., post‑norm or encoder‑decoder hybrids) may behave differently.
  • Single‑token focus – DTI optimizes one new token at a time. Extending the approach to multi‑token concepts (phrases) could further broaden its applicability.
  • Prior choice – The von Mises‑Fisher prior is simple but may not be optimal for all domains; learning a more expressive prior could improve convergence speed.
  • User studies – While quantitative metrics improve, a systematic human evaluation of prompt fidelity and perceived quality is still pending.

Bottom line: Directional Textual Inversion offers a low‑cost, high‑impact upgrade to personalized text‑to‑image pipelines, turning a subtle mathematical bug into a practical feature that developers can start using today.

Authors

  • Kunhee Kim
  • NaHyeon Park
  • Kibeom Hong
  • Hyunjung Shim

Paper Information

  • arXiv ID: 2512.13672v1
  • Categories: cs.LG, cs.CV
  • Published: December 15, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »