DALL·E 3 — System Overview, Safety-Guided Image Generation, and Evaluation

Overview

DALL·E 3 is an artificial intelligence system that takes a text prompts as an input and generates a new image as an output. Developed by OpenAI, the system emphasizes improved caption fidelity and image quality compared to earlier generations, integration with conversational interfaces for prompt generation, and expanded mitigations aimed at reducing sensitive or misleading outputs. Key methodological elements include the use of synthetic prompts generated with GPT-4, classifier-guided safety mechanisms, and a mitigation stack to limit generation of images of public figures.

Positioning and Problem Statement

The primary problem addressed is generation of images from textual descriptions while reducing harmful or unwanted outputs. This includes reducing unsolicited racy content, mitigating latent and ungrounded biases in generated images, and minimizing the generation of images of public figures to lower misinformation and ethical risks. Prior approaches (for example, earlier iterations like DALL·E 2) exhibited representation biases and insufficient handling of visual synonyms that could lead to inappropriate content; the system refines prompt grounding and safety screening to address those insufficiencies.

Capabilities, Inputs, and Outputs

DALL·E 3 supports generation of images from text prompts and is designed to produce images that adhere more closely to user instructions while incorporating safety measures. Inputs accepted are textual prompts, including prompts with specific descriptors or requests for particular artistic styles. Outputs are generated images, with stated tendencies to produce images with reduced racy content and images that can resemble artist styles when prompted, subject to style-related restrictions (for example, refusal modes for living artists).

Method and Architecture Summary

The approach centers on combining a frozen CLIP image encoder with auxiliary safety modeling and classifier-based conditioning to steer generation away from undesired content. The core technical ideas include use of classifiers to filter racy content in generated images, classifier guidance during sampling to both mitigate racy outputs and improve overall image quality, and transformations of ungrounded prompts to improve grounding in the generation process. The architecture choices documented highlight the use of a CLIP encoder as a backbone and tuning of system prompts to balance performance with complexity and latency. A mitigation stack is applied to specifically prevent generation of images of public figures.

Training Objectives and Supervision

The principal training objective reported is to detect and prevent racy content in generated images. Models are trained to predict a safety score for generated images, using a combination of frozen encoders and auxiliary safety predictors. Classifier guidance is applied at inference to steer outputs according to predicted safety scores.

Training Data and Curation

Training data consist of image-caption pairs drawn from a combination of publicly available and licensed sources. The dataset scale includes "100K non-racy images" and "100K racy images" as explicit counts provided. Initial labeling used a text-based moderation API, and negative samples were created by modifying non-racy images to form adversarial or edge-case examples. Synthetic prompts were generated with GPT-4 to augment image request diversity.

Sampling, Inference, and Guidance

Inference employs Classifier guidance as a principal mechanism to mitigate racy content and to improve output fidelity. System prompts are tuned to manage a trade-off between performance and latency. Classifier guidance is described both as a safety intervention and a quality-improvement method; additional output-level classification and blocklist checks are applied as part of the output filtering pipeline.

Evaluation and Reported Results

Evaluation included use of the hard64 benchmark and task evaluation focused on racy content detection. Metrics cited include AUC (Area Under the Curve) and aggregate accuracy on sub-tasks. Headline quantitative results reported include a range of variant scores on a benchmark named "eval1" and other evaluation configurations; selected reported results are reproduced verbatim:

"88.9" on "eval1" for model variant "Baseline"; "92.5" on "eval1" for "Hyper-param tuning 2"; "95.7" on "eval1" for "Above + clean by cog-api"; "95.7" on "eval1" for "Above + more cleaned data"; "95.6" on "eval1" for "Above + cut-paste data".
Paired-format scores: "88.9/22.3" for benchmark "Baseline" / model variant "Baseline"; "88.9/22.3" for benchmark "Above + 3 crops in inference" / model variant "Above + 3 crops in inference"; "87.6/16.9" for benchmark "Hyper-param tuning" / model variant "Hyper-param tuning"; "87.4/9.6" for "Above + clean by cog-api"; "88.2/10.6" for "Above + more cleaned data"; "88.1/10.4" for "Above + cut-paste data"; "88.1/10.4" for "Above + 3 crops in inference".
Prompt transformation and tuning comparisons: scores of 60.8 for "Untuned, No Secondary Prompt Transformation"; 71.6 for "Untuned, GPT-4"; 71.1 for "Untuned, Fine Tuned GPT 3.5"; 64 for "Tuned 1, No Secondary Prompt Transformation"; 75.3 for "Tuned, GPT-4"; 77.1 for "Tuned, Fine Tuned GPT 3.5".
Public figure generation claims: "With DALL·E 3-early, 2.5% of generated images contained public figures." and "With DALL·E 3-launch, none of the generated images contained public figures."

Evaluation protocols referenced aggregate accuracy and AUC metrics; specific human-evaluation summaries and full benchmark procedures are not reproduced beyond the reported scores above.

Safety Risks and Mitigations

The system explicitly addresses a range of risks including explicit content (graphic sexual and violent imagery), images of hate symbols, biological/chemical/weapon related risks, mis/disinformation (including misleading images of public figures), unsolicited racy imagery, societal risks related to bias and representation, objectification and sexualization tendencies, and potential misuse for misinformation.

Primary mitigations and filters applied across input and output include ChatGPT refusals, prompt input classifiers, blocklists, prompt transformations, image output classifiers, model-level interventions, and refusal modes for generating images in the style of living artists (including maintaining a blocklist for living artist names).

ChatGPT refusals and prompt input classifiers
Blocklists and model-level interventions
Prompt transformations and secondary prompt processing
Image output classifiers and expanded output blocklists
Refusal to generate images in the style of living artists

Limitations and Directions for Future Work

Known limitations include susceptibility to generating harmful outputs in edge cases, inaccuracies in training data due to reliance on text-based moderation, bias and potential reinforcement of stereotypes in generative outputs, inaccuracies in generating scientifically relevant images, and potential copyright or trademark considerations for generated images. Future work items reported include refining user customization for interactions with DALL·E 3, developing monitoring methods to flag photorealistic imagery for review, creating provenance methods to detect images generated by DALL·E 3, exploring partnerships between content creation and dissemination platforms, and addressing alignment between image generation models and human value systems.

Notable Components and Claims

Key components emphasized are the frozen CLIP image encoder backbone, Classifier guidance for inference-time steering, and synthetic prompt generation using GPT-4. Reported dataset scale figures and evaluation scores are preserved as provided, along with the claim that the launch configuration eliminated generation of public figures in reported samples.

Sources

DALL E 3 System Card