Measuring What Matters: Objective Metrics for Image Generation Assessment
Source: Dev.to
Introduction
Generating high‑quality visuals with state‑of‑the‑art models is becoming increasingly accessible. Open‑source models run on laptops, and cloud services turn text into images in seconds. These models are already reshaping industries like advertising, gaming, fashion, and science.
But creating images is the easy part. Judging their quality is much harder. Human feedback is slow, expensive, biased, and often inconsistent. Moreover, quality has many facets: creativity, realism, and style don’t always align. Improving one can hurt another.
That’s why we need clear, objective metrics that capture quality, coherence, and originality. Below we explore methods for evaluating image quality and comparing models with Pruna, beyond simply asking “does it look cool?”.
Metrics Overview
There is no single correct way to categorize evaluation metrics, as a metric can belong to multiple categories depending on its usage and the data it evaluates. In our repository, all quality metrics can be computed in two modes:
- Single mode – evaluates a model by comparing the generated images to input references or ground‑truth images, producing one score per model.
- Pairwise mode – compares two models by directly evaluating the generated images from each model together, producing a single comparative score for the two models.
This flexibility enables both absolute evaluations (assessing each model individually) and relative evaluations (direct comparisons between models).
On top of the evaluation modes, it also makes sense to think about metrics in terms of their evaluation criteria. Our metrics fall into two overarching categories:
- Efficiency Metrics – measure speed, memory usage, carbon emissions, energy, etc., during inference. (We omit a detailed discussion here; see our documentation for more.)
- Quality Metrics – measure generated images’ intrinsic quality and alignment to intended prompts or references. These include:
- Distribution Alignment – how closely generated images resemble real‑world distributions.
- Prompt Alignment – semantic similarity between generated images and their intended prompts.
- Perceptual Alignment – pixel‑level or perceptual similarity between generated and reference images.
Quality Metrics Summary
| Metric | Measures | Category | Range (↑ higher is better / ↓ lower is better) | Limitations |
|---|---|---|---|---|
| FID | Distributional similarity to real images | Distribution Alignment | 0 → ∞ (↓) | Assumes Gaussianity, requires a large dataset, depends on a surrogate model |
| CMMD | CLIP‑space distributional similarity | Distribution Alignment | 0 → ∞ (↓) | Kernel choice affects results, depends on a surrogate model |
| CLIPScore | Image‑text alignment | Prompt Alignment | 0 → 100 (↑) | Insensitive to image quality, depends on a surrogate model |
| PSNR | Pixel‑wise similarity | Perceptual Alignment | 0 → ∞ (↑) | Not well perceptually aligned |
| SSIM | Structural similarity | Perceptual Alignment | –1 → 1 (↑) | Can be unstable for small input variations |
| LPIPS | Perceptual similarity | Perceptual Alignment | 0 → 1 (↓) | Depends on a surrogate model |
Distribution Alignment Metrics
Distribution alignment metrics measure how closely generated images resemble real‑world data distributions, comparing both low‑ and high‑dimensional features. In pairwise mode, they compare outputs from different models to produce a single score that reflects relative image quality.


Fréchet Inception Distance (FID)
FID (introduced here) is one of the most popular metrics for evaluating how realistic AI‑generated images are. It works by comparing the feature distribution of reference images (e.g., real images) to the images generated by the model.
How it works
- Pass both real and generated images through a pretrained surrogate model (usually Inception v3).
- The model converts each image into a feature embedding.
- Assume the embeddings from each set follow a Gaussian distribution.
- Measure the distance between the two Gaussians; the smaller the distance, the better.
A lower FID score indicates that the generated images are more similar to real ones, meaning better image quality.
Mathematical formulation
$$ \text{FID} = |\mu_r - \mu_g|^2 + \operatorname{Tr}!\bigl(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\bigr) $$
where
- ((\mu_r, \Sigma_r)) are the mean and covariance of real‑image features,
- ((\mu_g, \Sigma_g)) are the mean and covariance of generated‑image features,
- (\operatorname{Tr}(\cdot)) denotes the trace of a matrix, and
- ((\Sigma_r \Sigma_g)^{1/2}) is the matrix square root (geometric mean) of the covariances.
Clip Maximum‑Mean‑Discrepancy (CMMD)
CMMD (introduced here) measures how close generated images are to real ones using embeddings from a pretrained CLIP model instead of Inception features.
How it works
- Pass both real and generated images through a pretrained CLIP model to obtain feature embeddings.
- No Gaussian assumption is made about the embeddings.
- Apply a kernel function (typically RBF) to compare the two distributions via the Maximum Mean Discrepancy (MMD) framework.
A lower CMMD score indicates that the feature distributions of generated images are more similar to those of real images, meaning better image quality.
Mathematical formulation
$$ \text{CMMD} = \mathbb{E}!\bigl[ k(\phi(x_r), \phi(x_r’)) \bigr] + \mathbb{E}!\bigl[ k(\phi(x_g), \phi(x_g’)) \bigr] - 2,\mathbb{E}!\bigl[ k(\phi(x_r), \phi(x_g)) \bigr] $$
where
- (\phi(\cdot)) denotes the CLIP embedding function,
- (k(\cdot,\cdot)) is a kernel (e.g., RBF), and
- the expectations are taken over pairs of real ((x_r, x_r’)) and generated ((x_g, x_g’)) samples.