SDXL (Stable Diffusion 2.x) — Model and Fine‑tuning Overview

Overview and Positioning

SDXL is a latent diffusion model developed by Stability AI aimed at improving the visual fidelity and controllability of text‑to‑image synthesis compared to previous Stable Diffusion releases. It focuses on higher‑quality, high‑resolution image generation by expanding model capacity, changing the UNet internals, and introducing new conditioning mechanisms that explicitly encode original image dimensions and cropping information. SDXL also employs a multi‑stage approach that includes a refinement model to further improve sample fidelity.

SDXL addresses several practical issues observed in prior approaches: loss of useful training data due to minimum resolution cutoffs, cropping artifacts introduced by aggressive data augmentation, and accessibility limitations caused by multi‑model inference pipelines. It emphasizes better composition, prompt adherence, and handling of multiple aspect ratios.

Core Capabilities

Text-to-image synthesis
Controllable image editing
Image personalization
Synthetic data augmentation

These capabilities are implemented by conditioning a generative UNet in latent space via textual encodings and auxiliary spatial/size signals, and by a downstream refinement stage that enhances visual detail.

Key Architectural Choices

SDXL scales the UNet backbone substantially and alters transformer placement to improve low‑level feature processing. Notable architecture choices include a UNet backbone with a model size of 2.6B parameters, the use of two text encoders totaling 817M parameters, and a reallocation of transformer computation to lower UNet levels (configurations with 2 and 10 transformer blocks at lower levels, omitting a transformer block at the highest feature level). The lowest 8× downsampling level of the UNet was removed in the final design.

The image representation is in latent space, decoded with a pretrained learned autoencoder. The improved autoencoder was trained from scratch for this work and reportedly outperforms the original in evaluated reconstruction metrics. Text conditioning uses more powerful pretrained text encoders (OpenCLIP ViT-bigG and CLIP ViT-L were explored).

Conditioning, Prompting, and Control Mechanisms

SDXL introduces multiple novel conditioning schemes to control size, cropping, and composition:

Conditioning on original image resolution via a vector formed from Fourier feature encodings of the original height and width.
Crop‑conditioning and multi‑aspect training to reduce cropping artifacts and to handle varying aspect ratios.
Joint training of conditional and unconditional models by replacing conditional signals with a null embedding (enabling classifier‑free guidance).

These mechanisms are designed to improve prompt adherence, reduce unwanted cropping of synthesized objects, and allow explicit control over output dimensions.

Training Setup and Procedures

Training followed a multi‑stage procedure. A base pretraining stage used an internal dataset and ran for 600,000 optimization steps at a resolution of 256 × 256 pixels. A finetuning stage applied multi‑aspect training, where training batches were composed of images from the same aspect‑ratio bucket to better specialize behavior across sizes.

One reported variant, the CIN‑512‑only model, was trained on a dataset of only 70k images. Batch construction and aspect bucketing were important parts of the curriculum. The diffusion process used a discrete‑time schedule with 1000 steps during training.

The method also includes a refinement model applied post‑hoc as part of a two‑stage generation process to further enhance visual fidelity.

Sampling and Inference

Sampling commonly uses a DDIM sampler with 50 steps; many evaluations generated samples with 50 DDIM steps (e.g., 5k samples with 50 steps). Classifier‑free guidance is used to improve sampling quality; reported guidance scales include a scale of 5 and a cfg‑scale value of 8.0 in different contexts. The model and refinement pipeline require multiple large components at inference, which affects accessibility and sampling speed.

Evaluation and Reported Results

Evaluation used a mix of automatic metrics, specialized benchmarks, and human studies. Benchmarks and metrics cited include FID, CLIP scores, IS, PNSR, SSIM, LPIPS, and rFID, as well as the PartiPrompts (P2) benchmark. Human preference studies reported that SDXL with refinement was the highest rated option in user studies and that SDXL was favored 54.9% of the time over Midjourney v5.1.

Headline quantitative results reported include:

CIN-512-only: FID‑5k 43.84, IS‑5k 110.64
CIN-nocond: FID‑5k 39.76, IS‑5k 211.5
CIN-size-cond: FID‑5k 36.53, IS‑5k 215.34

Additional reconstruction/evaluation numbers for VAE variants were reported as: SDXL‑VAE: PNSR 24.7, SSIM 0.73, LPIPS 0.88, rFID 4.4; SD‑VAE 1.x: PNSR 23.4, SSIM 0.69, LPIPS 0.96, rFID 5.0; SD‑VAE 2.x: PNSR 24.5, SSIM 0.71, LPIPS 0.92, rFID 4.7. SDXL was reported to outperform previous Stable Diffusion 1.5 & 2.1 models across multiple metrics and user preference studies and to compare favorably with other contemporary systems such as Midjourney v5.1, DeepFloyd IF, DALLE‑2, and Bing Image Creator in specific evaluations.

Limitations and Failure Modes

SDXL improves many aspects of image synthesis, but important limitations remain:

Training pipelines require a minimal image size, which can force exclusion of useful data.
The approach requires two large models (generation and refinement), negatively impacting accessibility and inference speed.
Difficulties remain with complex prompts that require detailed spatial arrangements; hands are commonly reported as imperfect, and concept bleeding can occur when multiple objects are requested.
Challenges persist in rendering long, legible text and achieving perfect photorealism.
One reported observation states that FID scores for SDXL are worse than previous models.
Synthesized objects can be cropped due to random cropping during training, which can lead to visible artifacts or missing object parts.

Future directions mentioned include investigating a single‑stage model to improve quality and speed, exploring byte‑level tokenizers to improve text synthesis and character rendering, decreasing inference cost and increasing sampling speed, and further scaling and training techniques to capture fine‑grained details.

Safety Considerations

One risk explicitly mentioned is the potential for social and racial biases introduced by large‑scale training datasets. No specific mitigations or content filter lists were provided in the reported information.

Short Technical Summary of the Method

At a high level, SDXL is a two‑stage latent diffusion approach with a substantially enlarged UNet backbone and heterogeneous transformer placement. It conditions the diffusion model on richer signals (text + Fourier‑encoded height/width, crop cues), trains conditional and unconditional branches to enable classifier‑free guidance, and decodes latents with an improved learned autoencoder. A refinement model is applied to produced samples to boost visual fidelity. Training used a discrete diffusion schedule with 1000 steps and included a multi‑aspect finetuning stage after extensive pretraining.

Sources

https://arxiv.org/abs/2307.01952v1