GPT-IMAGE-EDIT-1.5M: Large-Scale Instruction-Based Image-Editing Corpus and Fine-Tuning Approach

Overview

GPT-IMAGE-EDIT-1.5M is a large-scale, high-quality corpus of image-editing triplets designed to support open-source research and model training for instruction-guided image editing. The collection centers on triplets of (instruction, source image, edited image) and was constructed by unifying and refining several existing datasets using modern generative and refinement tools. The corpus emphasizes high visual quality, alignment between textual instructions and edited outputs, and coverage of diverse editing operations to bridge a data gap in open-source image-editing research.

Dataset composition and primary tasks

The dataset combines material from OmniEdit, HQ-Edit, and UltraEdit into a unified corpus referred to as GPT-IMAGE-EDIT-1.5M (also referenced in benchmarks as ImgEdit-Full, GEdit-EN-full, Complex-Edit, OmniContext, UltraEdit, and related subsets). The primary modalities are image and text (multimodal), and the principal tasks enabled by the corpus are instruction-based image editing and training/evaluating models that must edit images according to free-form textual instructions.

Key dataset characteristics include triplets of instruction, source image, and edited image in consistent metadata format; high-quality generated images with fixed aspect ratios; and metrics and per-edit-type numeric scores reported in benchmark tables (for example BG Change, Color Alt., Mat. Mod., Motion, Portrait, Style, Add, Remove, Replace, Text, Tone, Avg).

Data generation and processing pipeline

Data generation and refinement relied heavily on automated model-assisted procedures. The pipeline used GPT-4o for textual instruction regeneration and as a data-generation/refinement tool, and the gpt-image-1 API to (re)generate high-quality images. Important processing steps and design choices:

Images were generated in high-quality mode and constrained to fixed aspect ratios: 1:1 (1024×1024), 3:2 (1536×1024), and 2:3 (1024×1536).
Source images with differing aspect ratios were padded to the nearest supported generation ratio prior to synthesis, and padding was cropped after generation. Processing included resizing back to original resolution to preserve pixel density and, where applicable, downsampling to 512 × 512 using bicubic interpolation (Original input size: 512 × 512; Processed input/output size: 1024 × 1024; Final input size after downsampling: 512 × 512).
Automatic quality filtering was applied to reject outputs with artifacts or residual padding. A strict quality filter rejects any sample if more than 0.5% of its border is uniform padding.
Synthetic-data strategies included output regeneration (regenerating edited images), instruction regeneration (gpt-rewrite of instructions), and full pair synthesis (synthesizing new reference inputs from textual instructions and applying the same edit). The HQ-Edit Generate Split synthesizes new reference input images from textual instructions and uses aspect ratios randomly chosen from the three supported ratios.
Qwen-VL-7b embeddings were used to improve condition-image alignment during curation.

Ablation studies reported in conjunction with the release validate a multi-step data generation strategy—specifically output regeneration and instruction rewriting—as providing tangible benefits to downstream performance.

Key contributions

Unification of OmniEdit, HQ-Edit, and UltraEdit into a single large corpus of over 1.5 million triplets.
Regeneration and augmentation of image-instruction pairs using GPT-4o and gpt-image-1 to improve visual quality and alignment.
Application of fixed-aspect-ratio generation, pad-and-crop alignment, and strict automatic quality filtering (border padding threshold of 0.5%).
Inclusion of synthetic-data strategies: output regeneration, instruction regeneration (e.g., 10% of OmniEdit rewritten by GPT-4o), and full pair synthesis for HQ-Edit Generate Split.
Public release of the dataset and associated fine-tuned models to support open-source research.

Data specification and fields

Each sample unit is an image-instruction pair (a triplet). Files and metadata are provided as triplets of (instruction, source_image, edited_image) with merged metadata in a consistent format. Core fields and their formats include:

instruction: text describing the desired edit.
source_image: image file representing the input to be edited.
edited_image: image file representing the resulting edited output.

Benchmark and evaluation fields reported as numeric scores include BG Change, Color Alt., Mat. Mod., Motion, Portrait, Style, Add, Remove, Replace, Text, Tone, and Avg (average across metrics; see reported tables). The dataset supports benchmarking across multiple axes: Instruction Following (IF), Identity Preservation (IP), and Perceptual Quality (PQ).

Scale, splits, and coverage

Reported scale and subset counts:

The consolidated corpus contains over 1.5 million samples (also reported as 1.5 million samples).
Smaller named subsets and their reported sizes: omniedit100k-base (100000), omniedit100k-gpt (100000), omniedit100k-gpt-rewrite (100000), hqedit100k-base (100000), hqedit100k-output-regen (100000), hqedit100k-pair-regen (100000).

The dataset focuses on the image-editing domain and covers instruction-guided editing and multimodal AI tasks. No language, geographic, or temporal coverage breakdowns are provided in the available specification.

Recommended uses and evaluation

Intended use cases emphasize open-source model training and research in instruction-following image editing, including training and evaluating models on free-form editing instructions using image-instruction pairs. Recommended benchmarks and evaluation protocols include GEdit-EN-full, ImgEdit-Full, and Complex-Edit. Complex-Edit is described with a complexity axis (C 8) and evaluation axes IF (Instruction Following), IP (Identity Preservation), and PQ (Perceptual Quality).

Reported metrics and baseline performance claims include:

FluxKontext fine-tuned on GPT-IMAGE-EDIT-1.5M achieves 7.24 on GEdit-EN-full, 3.80 on ImgEdit-Full, and 8.78 on Complex-Edit.
Additional reported scores: OmniEdit regeneration imgedit score 3.24, OmniEdit instruction-regeneration imgedit score 3.40, HQ-Edit pair-regen GEdit-EN score 5.73, and 7.236 average on GEdit-EN in related tables.
The corpus and fine-tuned models are reported as state-of-the-art among open-source methods and competitive with GPT-4o in certain evaluations.

Limitations and open questions

Data-quality sensitivity is highlighted as central to training outcomes, and the curation pipeline adopts strict filtering to minimize artifact and padding-related degradation. Observed failure modes include situations where instruction complexity without identity preservation harms results; instruction complexity that does not preserve identity can be counterproductive. Known limitations and open questions from the available materials include gaps in language and geographic coverage reporting, and a lack of publicly documented de-duplication or decontamination procedures in the dataset description.

Access and license

The dataset and associated fine-tuned models are described as publicly available. The specific license terms are summarized as "Publicly available"; no additional license text or access procedure is provided in the available specification.

Sources

https://arxiv.org/abs/2507.21033v1