Imagen (text-to-image diffusion model) — Photorealistic generation with deep language understanding

Google Research, Brain Team

Overview

Imagen is a text-to-image diffusion model that combines large transformer language models with diffusion-based image synthesis to produce highly photorealistic images and improved image-text alignment. The approach pairs a T5-XXL text encoder with a cascade of U-Net diffusion models (including an Efficient U-Net variant for super-resolution), and introduces sampling and training techniques such as classifier-free guidance and dynamic thresholding to improve fidelity and alignment. Imagen reports a zero-shot COCO FID-30K of 7.27 and a suite of human-evaluation results showing preference over several prior systems.

Key contributions

Introduced Imagen, a text-to-image diffusion model that uses large pretrained language models for text encoding.
Achieved a new reported state-of-the-art COCO FID of 7.27 (zero-shot FID-30K).
Introduced DrawBench, a comprehensive benchmark with 11 prompt categories for evaluating visual reasoning and social biases.
Demonstrated that larger text encoders (T5-XXL) improve image-text alignment and image fidelity relative to smaller text encoders.
Introduced dynamic thresholding during sampling to manage pixel saturation and enable much higher guidance weights.
Developed an Efficient U-Net variant that is reported to use less memory, converge faster, and sample faster (claimed 2-3x faster in steps/second and × 2 -3 faster at sampling).
Employed a pipeline of cascaded diffusion models for high-fidelity image generation and explicit cross-attention in super-resolution stages.
Applied classifier-free guidance to improve image-text alignment.

Positioning and the problem addressed

Imagen targets high-fidelity text-to-image synthesis while improving alignment between textual descriptions and generated imagery. The design addresses several perceived insufficiencies in earlier approaches: many prior models trained only on image-text pairs without leveraging text-only pretraining of large language models, standard guidance regimes caused a train–test mismatch at high guidance weights, and commonly used evaluation sets such as COCO provide a limited spectrum of prompts. DrawBench was introduced to evaluate a broader set of prompts and to probe model behavior across categories including rare words, positional descriptions, counting, and social content.

Method summary

Core idea: combine transformer-based language understanding with diffusion image synthesis. Textual inputs are encoded with a large pretrained language model and conditioned into a cascading diffusion U-Net pipeline. Classifier-free guidance is used to steer samples toward better alignment; dynamic thresholding and other sampling strategies are used to mitigate pixel saturation and enable higher guidance weights.

Conditioning and fusion: conditioning is performed on text embedding sequences and diffusion timestep embeddings. Cross-attention layers allow information flow from the text embeddings into the image diffusion U-Nets at multiple resolutions. Super-resolution models retain explicit cross-attention layers to maintain alignment at higher resolutions.

Sampling and sampling techniques: sampling methods include classifier-free guidance, static thresholding, dynamic thresholding, a discrete-time ancestral sampler, and DDIM. Dynamic thresholding is emphasized as an enabler for higher guidance weights with improved photorealism and image-text alignment.

Architecture

Backbone and components:

Text encoder: T5-XXL (reported as 4.6B parameters).
Image models: a U-Net backbone for base diffusion, and an Efficient U-Net variant used for super-resolution stages (e.g., 64 × 64 → 256 × 256; other cascaded stages up to 1024 × 1024 are described in the method).
The U-Net contains downsampling blocks and upsampling blocks and uses ResNetBlock, DBlock, and UBlock components.

Block configuration (reported details): the architecture includes blocks with channel sizes 128, 256, 512, and 1024. Each block is reported with strides [2, 2] and kernel_size [3, 3]. The number of ResNet blocks per stage is reported as 2, 4, 8, and 8 respectively. The highest-resolution block is reported with self_attention set to false, text_cross_attention set to true, and num_attention_heads set to 8.

Notable architectural choices and efficiency tricks:

Memory-efficiency improvements in Efficient U-Net include shifting model parameters to low-resolution blocks, scaling skip connections by 1 / √ 2, and reversing the order of downsampling/upsampling operations.
Efficient U-Net is claimed to converge significantly faster and to use less memory while producing better sample quality and faster inference.
Classifier-free guidance is used as an efficiency/quality trade-off during sampling.
Additional reported hyperchoices: feature_pooling_type set to attention, dropout set to 0.0, and use_scale_shift_norm set to true.
Scaling U-Net and text encoder sizes is reported to improve the fidelity/alignment trade-off curves.

Training data and curation

Training example format: image paired with English alt-text captions.

Scale and composition: training data is reported at approximately 460M image-text pairs, including approximately 400M image-text pairs from LAION-400M. The training data was filtered to remove noise and undesirable content. It is explicitly reported that LAION-400M contains inappropriate content including pornographic imagery and racist slurs.

Curation and filtering: filtering was applied to remove undesirable content; no additional procedural details are reported here.

Training setup and systems

Reported compute and distribution: the base 64 × 64 model was trained on 256 TPU-v4 chips, and super-resolution models were trained on 128 TPU-v4 chips.

Optimization: Adafactor was used for the base model, while Adam was used for super-resolution models.

Training steps: a typical training scale is reported as 2.5M training steps.

Sampling and inference

Guidance and weights: classifier-free guidance is central to steering samples; reported guidance weights include 1.35 for the 64 × 64 model and 8.0 for the super-resolution model. The method reports that larger guidance weights improve sample quality up to around 7-10, and that dynamic thresholding permits the use of much higher guidance weights. A reported sweep of guidance values includes: [1, 3, 5, 7, 8, 10, 12, 15, 18].

Thresholding and artifacts: static thresholding is reported to prevent blank images but can cause oversaturation. Dynamic thresholding is introduced to manage pixel saturation adaptively, improving photorealism and image-text alignment compared to static approaches.

Samplers and variants: sampling methods reported include a discrete-time ancestral sampler and DDIM, in addition to classifier-free guidance with static and dynamic thresholding strategies.

Efficiency claims: the Efficient U-Net variant is reported to be 2-3x faster in steps/second and × 2 -3 faster at sampling, while using less memory and converging faster.

Capabilities and use

Supported tasks: text-to-image synthesis and generation, image editing and image-to-image super-resolution (notably 64 × 64 → 256 × 256), plus image quality and image-text alignment evaluation.

Inputs accepted: text prompts, text inputs, text embeddings, low-resolution images, and text descriptions.

Outputs produced: photorealistic, high-resolution images.

Editing capabilities: the pipeline extends image-editing capabilities and supports text-conditioned image manipulation via diffusion-based super-resolution and explicit cross-attention.

Interactivity and throughput: Efficient U-Net is reported to improve runtime throughput and sampling speed as noted above; no latency numbers are provided.

Evaluation

Benchmarks and protocols: DrawBench was introduced as a dedicated evaluation benchmark with 11 categories of prompts to probe visual reasoning skills and social biases. Standard benchmarks used include MS-COCO (COCO), COCO validation set, FID-10K/FID-30K, and CLIP-based metrics. Evaluation combined automatic metrics (FID and CLIP score) with extensive human ratings.

Metrics and human evaluation:

Automatic metrics reported include FID and CLIP score for image-text alignment; a reported zero-shot FID-30K of 7.27 on COCO.
Human evaluation protocol: 73 ratings per image for image quality, 51 ratings per image for image-text alignment. For DrawBench, 25 raters per category were used, totaling to 275 raters.
Human preference results: Imagen is reported to be preferred over DALL·E 2, GLIDE, and Latent Diffusion in image-text alignment and image fidelity. Reported preference rates include 39.2% for photorealism overall and 43.6% when people are removed from reference data. Imagen is reported to outperform GLIDE on 8 out of 11 categories on image-text alignment, and to be preferred over DALL·E 2 in 7 out of 11 categories for text alignment in certain comparisons.

Comparisons and headline results:

Imagen is reported to outperform GLIDE (reported FID 12.4) and DALL·E 2 (reported FID 10.4).
Headline quantitative claim: Imagen achieved a zero-shot FID-30K of 7.27 on COCO.
Raters are reported to prefer generations produced with the T5-XXL encoder over CLIP-based text encoders on DrawBench.
Imagen is reported to be preferred over DALL·E 2 in all 11 categories for sample fidelity in at least one comparative claim.

Limitations and open questions

High guidance weights can lead to unnatural images if not managed; naive use of large guidance weights often produces poor results without dynamic thresholding or other controls.
Training data limitations are acknowledged, including the reliance on large web-scraped datasets and problematic content within LAION-400M.
The model may drop modes of the data distribution.
Serious limitations are reported when generating images depicting people; behaviors include encoding social biases and stereotypes.
Smaller parameter configurations (reported 300M parameter models) significantly underperformed on DrawBench relative to larger models.
Future directions mentioned include exploring even bigger language models as text encoders and expanding benchmark evaluations for social and cultural bias.

Safety considerations

Risks: generative image models can be leveraged for malicious purposes, including harassment and misinformation. The model report explicitly notes the risk that generative methods can be repurposed for harmful applications.

Mitigation: training data was filtered to remove undesirable content. Reported blocked or sensitive content categories include pornographic imagery and toxic language. DrawBench was created in part to evaluate social biases and failure modes.

Summary of notable numerical claims and settings

COCO zero-shot FID-30K: 7.27.
Reported training scale: approximately 460M image-text pairs; approximately 400M from LAION-400M.
Text encoder: T5-XXL (4.6B parameters).
Training steps: 2.5M training steps (typical).
Compute: 256 TPU-v4 chips for base 64 × 64 model; 128 TPU-v4 chips for super-resolution models.
Guidance weights: 1.35 for 64 × 64 base; 8.0 for super-resolution; sweep includes [1, 3, 5, 7, 8, 10, 12, 15, 18].
Human evaluation: 73 ratings per image for image quality, 51 ratings per image for alignment, 25 raters per category, totaling 275 raters.
Preference rates: 39.2% preference rate for photorealism; 43.6% when people are removed from reference data.
Efficient U-Net speed claims: 2-3x faster in steps/second and × 2 -3 faster at sampling.

Sources

https://arxiv.org/abs/2205.11487v1