Skip to content

RealGen — Detector-Reward Guided Photorealistic Text-to-Image Generation

Overview and positioning

RealGen is a framework for photorealistic text-to-image generation that integrates a Large Language Model for prompt optimization and a diffusion model for realistic image generation. It targets the persistent photorealism gap in current text-to-image models and the lack of evaluation protocols that specifically quantify photorealism rather than general human preference or instruction-following performance. RealGen is presented as a photorealistic text-to-image approach that combines automated scoring, reinforcement-style optimization, and prompt refinement to produce images that are more difficult for detectors and humans to distinguish from real photographs.

Capabilities and primary use cases

  • Text-to-image generation; accepts text prompts and produces photorealistic images; supports automated photorealism evaluation via introduced benchmarks and scoring mechanisms.

RealGen is positioned primarily for generating high-photorealism images from text prompts. Inputs are textual prompts (which are expanded/optimized by an LLM) and outputs are images described as having high photorealism. No explicit multi-turn editing or interactive speed/latency claims are provided.

Core method and technical approach

The central idea is to use detection models as differentiable rewards to guide post-training of the generation pipeline. Three interlocking elements form the method:

  • A detector-based reward, referred to as the Detector Reward, is used to quantify AI artifacts and image realism at both semantic and feature levels. Detector-based realism quantification is applied as a guiding signal during optimization.
  • A language model is used for prompt expansion and optimization: initial user instructions are expanded into longer, optimized prompts by the LLM. The LLM component used is Qwen3-4B-Instruct.
  • The image generator is a diffusion-based backbone (employing FLUX.1.dev as the base image generator) that is post-trained under the Detector Reward signal using a reinforcement-style algorithm, GRPO.

The training and refinement process is described as a two-stage post-training pipeline guided by the detection reward model, and the overall optimization uses multiple reward functions (including detector-based and potentially aesthetic-like scores) to nudge the generative model away from typical AI artifacts such as overly smooth skin or unnatural sheen. Prompt engineering is a formalized step: the LLM produces optimized prompts that feed into the generator.

Architecture and training regimen

The pipeline combines an LLM text stream and a diffusion image stream: the LLM is responsible for intent understanding and prompt optimization while the diffusion model synthesizes images. Notable architecture and training choices include:

  • Backbone LLM: Qwen3-4B-Instruct for the prompt optimization component.
  • Base image generator: FLUX.1.dev used as the image synthesis foundation.
  • Two-stage post-training: a detector-reward guided post-training is applied on top of the existing generator.
  • Auxiliary training stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) optimization via GRPO.
  • Main objective: explicitly stated as optimizing for user intent refinement and photorealistic image generation.

No low-level tokenization, autoencoder, or latent representation details are provided beyond the high-level LLM + diffusion pairing.

Data, curation, and training examples

The training data for detector-guided post-training and evaluation includes a curated, small-scale high-quality set and references to real-image subsets used for detector calibration:

  • Dataset scale: "1000 high-quality real-world images."
  • Data sourcing: images were sourced from the internet and free photography websites, and from the real image subset of HPD v3.
  • Dataset composition: spans seven distinct categories with an increased proportion of the "Portrait" category.

These data elements are used for calibration, Detector Reward computation, and the introduced evaluation benchmark.

Sampling and optimization details

The reinforcement-style training component is noted to run as an RL process with an iteration count reported as approximately 230 steps. The Detector Reward is applied across this optimization loop to shape generator outputs. No further sampling schedules, timestep distributions, or latency numbers are provided.

Evaluation methodology and metrics

RealGen introduces a dedicated benchmark, RealBench, aimed at automated evaluation of photorealism. Evaluation combines detector-based scoring and pairwise preference-style protocols:

  • Detector-based realism quantification: Detector-Scoring (also referenced as DetectorScoring / Detector-Scoring).
  • Pairwise preference evaluation: Arena-Scoring (arena-style preference evaluation).
  • Additional metrics used in the broader evaluation suite include Pick-Score, HPSv2.1, HPSv3, LongCLIP, CLIP, and others. External benchmarks referenced include Forensic-chat, OmniAID, Effort, GPT 5-Prompt, VS Real, and VS Others, plus the Photo subset of HPD v2.
  • Human evaluation: RealGen is reported to achieve a win rate approaching 50% in pairwise comparisons against real images.

The evaluation protocol therefore mixes automated detector-derived measurements and Arena-style human (or simulated) preference comparisons.

Quantitative results and comparative performance

RealGen reports comparative performance across a suite of contemporary text-to-image models using the Detector-Scoring arrays, Arena-Scoring pairs, and other metric vectors. Representative reported numbers (preserved exactly) include the following model-level summaries.

  • FLUX-Pro — Detector-Scoring: [57.45, 21.55, 20.94, 50.14]; Arena-Scoring: [18.2, null]; Other metrics: [23.68, 86.85, 30.79, 12.78].
  • SDXL — Detector-Scoring: [43.37, 24.44, 8.44, 23.82]; Arena-Scoring: [9.22, null]; Other metrics: [23.02, 84.65, 28.44, 9.87].
  • Qwen-Image — Detector-Scoring: [57.47, 36.82, 17.1, 65.03]; Arena-Scoring: [18.25, 47.35]; Other metrics: [21.97, 82.83, 25.5, 8.15].
  • Ours — Detector-Scoring: [70.59, 37.85, 31.71, 92.79]; Arena-Scoring: [43.41, 74.8]; Other metrics: [23.58, 86.8, 31.87, 13.61].
  • Ours* — Detector-Scoring: [80.84, 47.2, 38.35, 96.73]; Arena-Scoring: [50.15, 84.85]; Other metrics: [21.75, 87.69, 28.24, 11.11].

A broad quantitative claim included with the evaluation data states: "RealGen outperforms existing leading T2I models across multiple key photorealism metrics." Additional per-model numbers are provided in the full evaluation tables and include many other models (Nano-Banana, SeedDream 3.0, GPT-Image-1, SD-3.5-Large, FLUX.1-dev, FLUX.1-Kontext, Echo-4o, Bagel, SRPO, FLUX-Krea, SRPO, etc.) with their Detector-Scoring arrays, Arena-Scoring values, and other metric tuples.

Headline outcomes explicitly reported:

  • "RealGen achieves the highest overall win rate in model-vs-model battles."
  • Reported score pairs include: for model "Ours" Detector-Scoring 71.34 and Aesthetic Scoring 30.18; for baseline "FLUX.1-dev" Detector-Scoring 43.03 and Aesthetic Scoring 12.16.

These results reflect the combined detector-guided optimization, LLM prompt refinement, and the RealBench evaluation protocol.

Key contributions and distinguishing elements

RealGen's principal contributions and distinguishing elements are:

  • The introduction and use of a Detector Reward mechanism to quantify artifacts and guide photorealism optimization.
  • Application of the GRPO algorithm to optimize the generation pipeline in a reinforcement-style post-training stage.
  • The proposal of RealBench as an automated benchmark for photorealism evaluation, alongside an Arena-style pairwise scoring (Arena-Scoring) and detector-derived metrics (Detector-Scoring).
  • Integration of Qwen3-4B-Instruct as the LLM prompt optimizer and FLUX.1.dev as the base image generator.

These elements combine to produce the reported improvements in detector-resistance and perceived photorealism.

Open evaluation observations

Evaluation emphasizes photorealism specifically, distinguishing it from general human preference scores or instruction-following benchmarks. Detector-based scoring and Arena-style comparisons are used in tandem to quantify improvement in realism and to show comparative gains against established T2I models such as GPT-Image-1, Qwen-Image, and FLUX-Krea.

Sources

https://arxiv.org/abs/2512.00473v1