PIXARTδ — Fast, Controllable Text-to-Image via Latent Consistency and Transformer ControlNet

Overview

PIXARTδ is a text-to-image synthesis framework that integrates Latent Consistency Models (LCM) and ControlNet concepts into a Transformer-based backbone derived from prior PIXART work. The design emphasizes rapid inference, high-resolution output, and explicit condition control. Reported deployments produce high-quality images at 1024px resolution while supporting architecture-specific ControlNet adaptations for Transformer stacks.

Key highlights

High-quality image generation at 1024px resolution.
Generates 1024 × 1024 images in 0.5 seconds (reported on A100).
Achieves an inference speed improvement of 7 × compared to PIXARTα.
Supports 4-step sampling while maintaining quality.
8-bit inference capability enables operation within 8GB GPU memory constraints.

Positioning and intended problems solved

PIXARTδ targets two practical shortcomings in contemporary text-to-image pipelines: insufficient inference speed at high resolution, and weak or awkward control mechanisms when adapting ControlNet-style conditioning to Transformer architectures. The approach explicitly addresses Transformer idiosyncrasies—where conventional encoder/decoder connections used by UNet-based ControlNet are not directly applicable—by introducing a Transformer-native ControlNet variant. The result aims to provide faster and more controllable image generation compared to prior PIXART variants and SDXL baselines.

Architectural summary

PIXARTδ uses a Transformer backbone and operates in latent space. The core backbone contains 28 Transformer blocks. Two different ControlNet styles are described and compared:

ControlNet-UNet (baseline style): conceptually treats the first 14 blocks as an "encoder" and the last 14 as a "decoder".
ControlNet-Transformer (proposed): creates N trainable copies of the first N base blocks to integrate explicit conditioning within the Transformer stack. This design replaces the zero-convolution used in UNet-based ControlNet with a zero linear layer to fit Transformer parameterization.

Notable design choices include a customized ControlNet architecture suited for Transformer models, and specific diffusion schedule adjustments: β_t schedule modified from a scaled linear curve to a linear curve, with β_t0 changed from 0.00085 to 0.0001 and β_tT changed from 0.012 to 0.02. The model is reported to support image resolutions up to 1024 × 1024.

Conditioning, control, and editing

Conditioning for controlled generation frequently uses HED edge maps as the conditioning signal. The ControlNet-Transformer architecture is presented as offering fine-grained control over text-to-image diffusion outputs by allowing precise integration of the condition into early Transformer blocks. Editing capabilities emphasize controllable, condition-guided synthesis rather than unconstrained generation.

Training objectives and distillation

Training centers on a consistency distillation objective, with Latent Consistency Distillation (LCD) named explicitly as the distillation method used. Distillation is a core component to accelerate sampling (enabling 2–4 step samplers) while preserving image quality. The reported training process includes both a distillation stage and practical memory-aware training workflows enabling moderate hardware requirements.

Training data and scale

Reported dataset figures vary by report and task framing. Two dataset statements appear in parallel:

"120K internal image-text pairs"
"Training set consists of 3M HED and image pairs"

Both dataset quantities are reported as used in different parts of the training or evaluation pipeline; no merging or reconciliation of these counts is provided.

Training setup and resource requirements

Finite-resource training and distillation are emphasized. Reported infrastructure and memory characteristics include: PIXARTδ was reported to be efficiently trainable on 32GB V100 GPUs within a single day, and full finetuning is said to require less than 24GB of GPU memory. Distributed training has been conducted on 16 V100 GPUs with 32GB. The model also supports 8-bit inference to reduce GPU memory requirements during inference to below 8GB.

Sampling, inference, and acceleration

PIXARTδ leverages LCM-based acceleration and consistency distillation to minimize step counts and latency during inference. Reported sampling and timing claims include:

Typical inference: 2–4 steps; PIXARTδ specifically requires 4 steps for generation.
Comparative step counts: PIXARTα requires 14 steps and SDXL standard requires 25 steps in reported comparisons.
Inference latency reports: 0.5 seconds for 1024 × 1024 images on A100, 0.8 seconds on V100, and 3.3 seconds on T4. Additional reported latencies include "1 second for image generation" in some contexts.
Guidance uses classifier-free guidance with guidance scale ω fixed at 4.5.

Reported convergence/iteration figures vary across reports: an "approximately 5,000 iterations for convergence" is cited in one context, while "convergence observed at around 1,000 training steps" is cited elsewhere.

Evaluation and comparisons

Evaluation uses established metrics such as FID and CLIP for quantitative comparison. PIXARTδ is reported to be compared against PIXARTα, SDXL-LCM-LoRA, and SDXL standard. Headline evaluation claims include:

Generation speed improvements over SDXL standard, which is reported to take up to 26.5 seconds on T4 and 3.8 seconds on A100.
ControlNet-Transformer yielding better controllability, faster convergence, and improved overall performance than ControlNet-UNet in reported comparisons.

Efficiency and deployment considerations

Multiple engineering choices target throughput and memory efficiency: 8-bit inference capability, 4-step sampling acceleration, and distillation conducted within a 32GB GPU memory footprint. These choices are presented to enable high-resolution generation at interactive latencies and to permit larger effective batch sizes on constrained GPUs.

Limitations and future work

Future work reported includes exploration of other conditioning signals beyond HED, for example canny edge maps. No explicit failure cases or safety mitigations are listed in the reported material.

Sources

https://arxiv.org/abs/2401.05252v1