Nano Banana Pro — Generative Multimodal Approach for Low-Level Vision

Overview

Nano Banana Pro (also referenced as NB Pro, NBPro, and related family names) is a generative multimodal image model evaluated for a broad set of low-level vision tasks. The approach emphasizes semantic coherence and perceptual quality using generative priors and optimized latent diffusion techniques combined with efficient attention mechanisms. Evaluations cover 14 distinct tasks across 40 datasets and contrast perceptual quality against traditional pixel-fidelity metrics.

Positioning and Problem Statement

Nano Banana Pro targets practical low-level vision challenges where existing task-specific methods or regression-based pipelines fall short. The major problem domains addressed include image restoration, enhancement, fusion, and other concrete degradations such as dehazing, deraining, shadow and flare removal, denoising, defocus and motion deblurring, low-light enhancement, underwater image correction, HDR reconstruction, multi-focus fusion, and infrared-visible fusion.

Why conventional methods are often insufficient:

Many established methods require task-specific training and pixel-level supervision, producing over-smoothed textures and brittle generalization when real-world degradations deviate from synthetic training data.
Collecting large-scale, perfectly aligned paired datasets for real-world degradations (e.g., hazy vs. clean images) is practically infeasible.
Regression-oriented and handcrafted-prior approaches emphasize pixel fidelity and can introduce visible artifacts, lack high-frequency detail, or fail to preserve semantic integrity in severe information-loss scenarios.
Generative and general-purpose multimodal models may prioritize plausibility over strict pixel-wise fidelity, producing perceptually appealing yet quantitatively lower-scoring outputs on PSNR/SSIM-style benchmarks.

A stated contribution is the use of a rich world model and multimodal reasoning to interpret degraded regions and synthesize perceptually coherent restorations without task-specific fine-tuning.

Core Method and Architectural Notes

The approach is built around generative priors and multimodal reasoning rather than dedicated supervised restoration architectures. Key methodological themes include:

Use of latent diffusion technology and efficient attention mechanisms to synthesize high-frequency details and plausible textures in areas of missing information.
Prioritization of semantic coherence and visual plausibility over pixel-perfect reconstruction; generative priors are leveraged to recover global structure and plausible local detail.
Architectural elements include a hierarchical U-shaped framework and local-enhanced selective scan mechanisms. The broader family includes both CNN- and Transformer-based design patterns; notable choices emphasize generative diffusion and transformer-style components.
Certain sub-methods apply a two-stage progressive strategy for flare elimination and detail recovery; prompting is used to guide zero-shot behavior via simple natural language instructions.

Conditioning and prompting relies on natural-language prompts such as: "This is a rainy image. Please remove the rain streaks and raindrops while keeping all other elements, the original color tone, lighting, and atmosphere unchanged." Image + text conditioning is used to direct restoration behavior, with a focus on zero-shot evaluation rather than supervised fine-tuning.

Capabilities, Inputs, and Outputs

The system supports a wide range of image-level tasks and accepts both image and textual instruction inputs. Representative capabilities include:

Image restoration
Image enhancement
Image fusion
Image super-resolution
Reflection removal

Inputs accepted include simple textual prompts, low-resolution or otherwise degraded images (rainy, noisy, flare-corrupted, low-light, low-dynamic-range, infrared/visible inputs), and task-specific natural language instructions. Typical outputs are restored images with high perceptual quality, high-fidelity, high-resolution images, dehazed and denoised images, images with reduced reflections or flare artifacts, and fused images with improved clarity and contrast. The model also supports high-resolution synthesis, multi-image fusion, accurate text rendering in images, and semantically coherent scene manipulation.

Training Data and Evaluation Datasets

Evaluation and data coverage are extensive and multi-faceted:

Evaluation and training references cite "40 diverse datasets."
Specific dataset counts or inventories referenced include:
DIV2K-Val dataset (2,994 images)
Flare7K++: 7,000 synthetic flares and 962 real-captured flare images
FlareReal600: 600 aligned training images
LOLv1: 485 training pairs, 15 testing pairs
LOLv2-real: 689 training pairs, 100 testing pairs
SICE: larger-scale dataset with diverse scenes
FiveK dataset: 500 aligned LDR and HDR image pairs
HDR+ benchmark: 250 test image pairs
Lytro: 20 pairs of multi-focus images
MFFW: 13 real image pairs with strong Defocus Spread Effect (DSE)
MFI-WHU: 120 pairs constructed using Gaussian blur and decision maps
SIMIF: 12 high-resolution pairs

Benchmarks introduced by the evaluation suite include Lytro, MFFW, MFI-WHU, and SIMIF. A large set of established benchmarks is also used for comparative evaluation (examples: DIV2K, RealSR, Rain200L/H, GoPro, RealBlur, DPDD, RealDOF, McMaster, Kodak24, Urban100, SIDD Val, UIEB, and many more).

The approach is frequently assessed in a zero-shot setting — explicitly evaluated without domain-specific fine-tuning.

Evaluation Protocols and Metrics

Evaluations employ a mixture of full-reference and no-reference metrics. Common metrics used include PSNR, SSIM, LPIPS, MS-SSIM, NIQE, NIMA, FADE, BRISQUE, UIQM, UCIQE, and a variety of task-specific or domain-specific statistics (EN, AG, SF, SD, SCD, V IF, △E, G-PSNR, S-PSNR, etc.). Human perceptual judgments are referenced qualitatively to contrast perceptual quality and pixel-wise fidelity.

Quantitative evaluation summaries highlight a recurring pattern:

The generative approach produces visually convincing and semantically plausible restorations, yet scores lower on common reference-based metrics (PSNR/SSIM) compared to specialist, task-specific networks.
Nano Banana Pro often underperforms specialist methods on PSNR/SSIM while performing strongly on no-reference perceptual metrics (e.g., UIQM, UCIQE) and subjective visual quality assessments.

Notable per-benchmark and per-model numbers (preserved from evaluation results):

On the Real20 benchmark, Nano Banana Pro scored PSNR: 20.26, SSIM: 0.655. For comparison, WindowSeat (Qwen-IE) scored PSNR: 26.6, SSIM: 0.864 and WindowSeat PSNR: 26.28, SSIM: 0.856.
Other Real20 entries include DSIT (data II) PSNR: 25.22, SSIM: 0.836; RDNet (w nature) PSNR: 25.58, SSIM: 0.846; WindowSeat and WindowSeat (Qwen-IE) among the top scorers.
On the Nature split, DSIT (data II) PSNR: 27.27, SSIM: 0.932; WindowSeat (Qwen-IE) PSNR: 27.57, SSIM: 0.855; DAI PSNR: 27.05, SSIM: 0.846.
On Objects, WindowSeat PSNR: 28.81, SSIM: 0.944; DSRNet (with extra) PSNR: 26.74, SSIM: 0.92.
On Wild, WindowSeat (Qwen-IE) PSNR: 29.44, SSIM: 0.936.

Selected headline and per-dataset results:

Rain200L: PSNR 26.05, SSIM 0.7954
Rain200H: PSNR 21.1, SSIM 0.6659
SPA-Data: PSNR 32.25, SSIM 0.9142
DPDD (NB Pro): PSNR 20.180 dB, SSIM 0.635
RealDOF (NB Pro): PSNR 20.821 dB, SSIM 0.641
GoPro (NB Pro variant): 21.41 dB
HIDEHIDE (NB Pro variant): 21.35 dB
DIV2K-Val: NIQE 3.52 (Nano Banana Pro)
Flare7K++ (Nano Banana Pro): PSNR 24.92, SSIM 0.844
LOLv1: PSNR 18.496, SSIM 0.684, LPIPS 0.481
LOLv2-real: PSNR 15.661, SSIM 0.537, LPIPS 0.465
HDR+: PSNR 14.24, SSIM 0.467, LPIPS 0.221, △ E 19.82
MIT-FiveK: PSNR 19.2, SSIM 0.639, LPIPS 0.133, △ E 11.14

Perceptual and no-reference metrics show strong performance:

On the RTTS dataset, NBPro: FADE 0.986, BRISQUE 27.21, NIMA 4.95.
On Fattal's set, NBPro: FADE 0.683, BRISQUE 22.16, NIMA 5.44.
NB Pro achieved top UIQM and UCIQE on LSUI and best UIQM/third UCIQE on U45 in referenced evaluations.
LPIPS examples: PPDN LPIPS 0.143; NB Pro variants LPIPS 0.287 (2K) and 0.361 (4K). PSNR for NB Pro(2K): 22.15; NB Pro(4K): 19.07. SSIM for PPDN: 0.708; NB Pro(2K): 0.496; NB Pro(4K): 0.424.

Comparative statements preserved from the evaluations note that Nano Banana Pro frequently trails specialist GAN, diffusion, CNN, and Transformer-based restoration models on pixel-wise benchmarks, while often ranking highly on no-reference perceptual metrics.

Strengths and Notable Behaviors

Demonstrates superior subjective visual quality and perceptual appeal in many scenarios, especially where semantic plausibility and global structure recovery matter more than pixel fidelity.
Excels in semantic comprehension, fine-grained visual modeling, and precise structural control in some contexts.
Utilizes semantic priors to reconstruct plausible details in information-deficient regions and tends to synthesize sharp textures for static environments and text.
Strong performance on no-reference perceptual metrics (UIQM, UCIQE, NIQE variants) and certain underwater evaluation metrics (EN, AG, SF, V IF).
Adaptable zero-shot behavior using simple textual prompts, enabling application without additional task-specific training.

Limitations and Failure Modes

Despite perceptual strengths, the approach has numerous documented limitations and concrete failure cases:

Systematic underperformance on pixel-wise reference metrics: consistently lower PSNR and SSIM compared to specialist models across many datasets.
Tendency to hallucinate textures and generate high-frequency details that deviate from ground truth; generative bias can produce structural or semantic deviations.
Sensitivity to prompt formulation and stochasticity inherent to diffusion-based generation can lead to inconsistent restoration behaviors across runs.
Specific failure modes include unnatural color shifts, semantic alteration or identity swaps in faces, haloes and excessive sharpening, ghosting artifacts, inversion or hallucination of weather elements, erroneous text hallucination, and incomplete or erroneous reflection removal.
Not suited for tasks demanding strict pixel-precise fidelity or rigorous factual accuracy. Examples:
NB Pro exhibits PSNR 20.26 and SSIM 0.655 on Real20, substantially lower than specialist methods in the same benchmark.
Occasional spatial misalignment and unintended field-of-view expansion.
Challenges in preserving spectral fidelity, handling subtle shadows, and modeling complex, realistic noise for denoising tasks.
Inability to fully reverse defocus in multiple test cases; outputs sometimes show only modest contrast improvements without true deblurring.
Frequent trade-off: improved perceptual metrics and subjective quality at the expense of reference-based fidelity.

Future Directions and Mitigations

Suggested or observed research directions and adaptations include:

Development of perception-aligned evaluation methods tailored for generative low-level vision solvers to better reflect human judgments.
Hybrid frameworks integrating generative backbones with task-specific physical constraints or regression modules to improve fidelity.
Prompt engineering and targeted prompt tuning for enhanced pixel-level restoration accuracy.
Lightweight fine-tuning strategies, adapter-based adaptation, or few-shot learning with example image pairs to reduce the performance gap with specialist models.
Integration of post-processing modules or physical constraints to better preserve color fidelity and structural consistency.
Systematic exploration of controllable generation and constrained generative processes to mitigate hallucination risks.

Safety Considerations and Risks

Documented risks include:

Semantic fidelity risks that can impair downstream computer vision applications (for example, detection or recognition systems) due to hallucinated or altered content.
Potential for false positives or misdetections in downstream algorithms because of artifacts introduced during generative restoration.
No specific mitigations or content filters are specified; safety evaluations and task-specific guardrails are recommended as part of deployment planning.

Summary of Comparative Observations

Nano Banana Pro delivers strong perceptual and no-reference results, often producing visually pleasing, semantically plausible outputs in challenging real-world degradations.
On traditional full-reference benchmarks and pixel-fidelity metrics (PSNR, SSIM), NB Pro frequently falls behind specialist restoration networks (GAN-, diffusion-, CNN-, and Transformer-based methods), sometimes by substantial margins (e.g., "lagged by over 4 dB in PSNR on DIV2K-Val").
The method is most appropriate where perceptual quality and semantic coherence are prioritized over strict pixel-level fidelity; for applications requiring exact reconstruction, hybrid or task-specific solutions remain preferable.

Sources

https://arxiv.org/abs/2512.15110v2