Text To Image Survey
Text To Image Survey
Survey
Abstract
This survey reviews the progress of diffusion models in generating images from text,
i.e. text-to-image diffusion models. As a self-contained work, this survey starts with
a brief introduction of how diffusion models work for image synthesis, followed by
the background for text-conditioned image synthesis. Based on that, we present an
organized review of pioneering methods and their improvements on text-to-image gen-
eration. We further summarize applications beyond image generation, such as text-
guided generation for various modalities like videos, and text-guided image editing.
Beyond the progress made so far, we discuss existing challenges and promising future
directions.
Keywords: Generative models, Diffusion models, Text-to-image generation
1. Introduction
A picture is worth a thousand words. Images often convey stories more effectively
than text alone. The ability to visualize from text enhances human understanding and
enjoyment. Therefore, creating a system that generates realistic images from text de-
scriptions, i.e., the text-to-image (T2I) task, is a significant step towards achieving
human-like or general artificial intelligence. With the development of deep learning,
text-to-image task has become one of the most impressive applications in computer
1 Corresponding author.
an espresso machine that makes panda mad scientist mixing a corgi’s head depicted as
coffee from human souls, artstation sparkling chemicals, artstation an explosion of a nebula
Figure 1: Generated images by text-to-image diffusion models. These images are examples generated by the
pioneering model DALL-E2 [1] from OpenAI. Based on user-input text prompts, the model can generate
very imaginative images with high fidelity.
2
AlignDRAW StackGAN ControlGAN CogView VQ Diffusion GLIDE Imagen SD XL SD 3
2015.11.09 2016.12.10 2019.09.16 2021.05.26 2021.11.29 2021.12.20 2022.05.23 2023.07.26 2024.02.22
Text-conditional GAN AttnGAN DALL-E NUWA Stable Diffusion DALL-E 2 Parti DALL-E 3 SD 3.5
2016.05.17 2017.11.28 2021.02.24 2021.11.24 (SD) 2021.12.20 2022.04.13 2022.06.22 2023.08.20 2024.10.22
Figure 2: Representive works on text-to-image task over time. The GAN-based methods, autoregressive
methods, and diffusion-based methods are masked in yellow, blue and red, respectively. We abbreviate
Stable Diffusion as SD for brevity in this figure. As diffusion-based models have achieved unprecedented
success in image generation, this work mainly discusses the pioneering studies for text-to-image generation
using diffusion models.
ural language, albeit with limited realism. Text-conditional GAN [4] emerged as the
first fully end-to-end differential architecture extending from character-level input to
pixel-level output, but was always trained on small-scale data. Autoregressive methods
further utilize large-scale training data for text-to-image generation, such as DALL-
E [5] from OpenAI. However, autoregressive nature makes these methods [5, 6, 7, 8]
suffer from high computation costs and sequential error accumulation.
More recently, diffusion models (DMs) have emerged as the leading method in
text-to-image generation [9, 1]. Figure 1 shows example images generated by the pi-
oneering text-to-image diffusion model DALL-E2 [1], demonstrating extraordinary
fidelity and imagination. However, the vast amount of research in this field makes it
difficult for readers to learn the key breakthroughs without a comprehensive survey. A
branch of existing surveys [10, 11, 12] reviews the progress of the diffusion model in
all fields, offering a limited introduction specifically on text-to-image synthesis. Other
studies [13, 11, 14] focus on text-to-image tasks using GAN-based approaches, lacking
the introduction of diffusion-based methods.
To our knowledge, this is the first survey to review the progress of diffusion-based
text-to-image generation. The rest of the paper is organized as follows. We also sum-
marize the paper outline in Figure 3. Section 2 introduces the background of diffusion
models. Section 3 covers pioneering studies on text-to-image diffusion models, while
Section 4 discusses the follow-up advancements. Section 5 discusses the evaluation of
text-to–image diffusion models from the technical and ethical perspectives. Section 6
3
Development Diffusion Probabilistic Models (DPM)
before DDPM Score-based Generative model(SGM)
Sec 2: Background on DDPM
diffusion models
Guidance in image
synthesis
Frameworks in GLIDE
pixel space Imagen
Sec 3: Pioneering text-to
image diffusion models Stable Diffusion
Frameworks in
DALL-E2
latent space
Recent progress
Evaluation metrics
Technical evaluation
Evaluation benchmarks
Sec 5: Model evaluation
Ethical risks from the datasets
Ethical issues and risks Misuse for malicious purposes
Security and privacy risks
Text-to-art
Text-to-X generation Text-to-video
Sec 6: Applications beyond Text-to-3D
image generation Inversion for image editing
Text-guided image
Sec 7: Challenges Editing with mask control
editing
and outlook Expanded editing with flexible texts
Figure 3: Paper outline. We summarize each section in this figure. Our work not only offers a comprehensive
overview of text-to-image diffusion models, but also provides readers a broader perspective by discussing
related areas such as text-to-X generation.
4
explores tasks beyond text-to-image generation., such as video generation and 3D ob-
ject generation. Finally, we discuss challenges and future opportunities in text-to-image
generation tasks.
Diffusion models (DMs), also widely known as diffusion probabilistic models [15],
are a family of generated models that are Markov chains trained with variational in-
ference [16]. The learning goal of DM is to reserve a process of perturbing the data
with noise, i.e. diffusion, for sample generation [15, 16]. As a milestone work, de-
noising diffusion probabilistic model (DDPM) [16] was published in 2020 and sparked
an exponentially increasing interest in the community of generative models afterwards.
Here, we provide a self-contained introduction to DDPM by covering the most related
progress before DDPM and how unconditional DDPM works with image synthesis as
a concrete example. Moreover, we summarize how guidance helps in conditional DM,
which is an important foundation for understanding text-conditional DM for text-to-
image.
The advent of DDPM [16] can be mainly attributed to two early attempts: score-
based generative models (SGM) [17] being investigated in 2019 and diffusion proba-
bilistic models (DPM) [15] emerging as early as in 2015. Therefore, it is important to
revisit the working mechanism of DPM and SGM before we introduce DDPM.
Diffusion Probabilistic Models (DPM). DPM [15] is the first work to model
probability distribution by estimating the reversal of Markov diffusion chain which
maps data to a simple distribution. Specifically, DPM defines a forward (inference)
process which converts a complex data distribution to a much simpler one, and then
learns the mapping by reversing this diffusion process. Experimental results on multi-
ple datasets show the effectiveness of DPM when estimating complex data distribution.
DPM can be viewed as the foundation of DDPM [16], while DDPM optimizes DPM
with improved implementations.
5
p✓ (xt 1 |xt )
xT ! · · · ! xt ! ! · · · ! x0
<latexit sha1_base64="XVzP503G8Ma8Lkwk3KKGZcZJbZ0=">AAACEnicbVC7SgNBFJ2Nrxhfq5Y2g0FICsNuFEwZsLGMYB6QLMvsZDYZMvtg5q4Y1nyDjb9iY6GIrZWdf+Mk2SImHrhwOOde7r3HiwVXYFk/Rm5tfWNzK79d2Nnd2z8wD49aKkokZU0aiUh2PKKY4CFrAgfBOrFkJPAEa3uj66nfvmdS8Si8g3HMnIAMQu5zSkBLrlmO3R4MGZBSLyAw9Pz0YeKmcG5P8CNekKDsmkWrYs2AV4mdkSLK0HDN714/oknAQqCCKNW1rRiclEjgVLBJoZcoFhM6IgPW1TQkAVNOOntpgs+00sd+JHWFgGfq4kRKAqXGgac7p0eqZW8q/ud1E/BrTsrDOAEW0vkiPxEYIjzNB/e5ZBTEWBNCJde3YjokklDQKRZ0CPbyy6ukVa3YF5Xq7WWxXsviyKMTdIpKyEZXqI5uUAM1EUVP6AW9oXfj2Xg1PozPeWvOyGaO0R8YX7+bCp4F</latexit>
xt 1
<latexit sha1_base64="l4LvSgM7PR7I/kkuy5soikK4gpU=">AAAEoXictVLditNAFE7XqGv92a5eejOYLexKLU0VFKRQ9EYvhCrb3YUklOlk2g6dnzBzYrcb8zK+lU/gazhJK6atuiB4YODM+T/n+8YJZwY6nW+1vRvuzVu39+/U7967/+CgcfjwzKhUEzokiit9McaGcibpEBhwepFoisWY0/Px/G3hP/9MtWFKnsIyoZHAU8kmjGCwplHjeygwzAjThNM4Kz/jSXaZj05zFHIlp5pNZ4C1VgsUkliB2TX/oQLYCpe/4rJwZhJM6NPMJyLPt9IM0SwBA0tOUaVGBs/8/J8mWVRH6eSjhtdpd0pBu4q/VjxnLYPR4d7XMFYkFVQC4diYwO8kEGVYA7P183qYGmr3meMpDawqsaAmykpEctS0lhhNlLZPAiqt1YwMC2OWYmwjiynNtq8w/s4XpDB5FWVMJilQSVaNJilHoFABL4qZpgT40irYntTOisgMa0zAkqC+0QbY/MquIfCcYssbsBH1UNIFUUJgGVePGfhR1qyj1YETXAaH/SqAnp836/lGftUfdNcFiqbBT8L2jouQdvE9iVAoVUyDWONFa5XVYlJSjezEPT+BlmCSiVQgw65or2vBaE0Y5z1e4D/VeBmhstwJyo5C0YeZ53vdo/z19lhVjly71+K6xRb/ZbO/rbLCS8HMwmVZ7W9zeFc567b95+3uxxde/82a3/vOY+eJc+z4zkun77xzBs7QIbUPNVP7Ustdz33vDtxPq9C92jrnkbMhbvAD81mObw==</latexit>
! sha1_base64="7yFrn0YPyuP5dVIvc7Tl2zcbS/g=">AAAB+HicbVBNSwMxEJ2tX7V+dNWjl2ARPJXdKuix6MVjBfsB7VKyaXYbmk2WJKvU0l/ixYMiXv0p3vw3pu0etPXBwOO9GWbmhSln2njet1NYW9/Y3Cpul3Z29/bL7sFhS8tMEdokkkvVCbGmnAnaNMxw2kkVxUnIaTsc3cz89gNVmklxb8YpDRIcCxYxgo2V+m65x6WIFYuHBislH/tuxat6c6BV4uekAjkafferN5AkS6gwhGOtu76XmmCClWGE02mpl2maYjLCMe1aKnBCdTCZHz5Fp1YZoEgqW8Kgufp7YoITrcdJaDsTbIZ62ZuJ/3ndzERXwYSJNDNUkMWiKOPISDRLAQ2YosTwsSWYKGZvRWSIFSbGZlWyIfjLL6+SVq3qn1drdxeV+nUeRxGO4QTOwIdLqMMtNKAJBDJ4hld4c56cF+fd+Vi0Fpx85gj+wPn8AXOGk5o=</latexit>
<latexit
q(xt |xt
<latexit sha1_base64="eAZ87UuTmAQoJ4u19RGH5tA+bCI=">AAACC3icbVC7TgJBFJ31ifhatbSZQEywkOyiiZQkNpaYyCMBspkdZmHC7MOZu0ay0tv4KzYWGmPrD9j5N87CFgieZJIz59ybe+9xI8EVWNaPsbK6tr6xmdvKb+/s7u2bB4dNFcaSsgYNRSjbLlFM8IA1gINg7Ugy4ruCtdzRVeq37plUPAxuYRyxnk8GAfc4JaAlxyzclbo+gaHrJQ8TB/AjnvsmcGZPTh2zaJWtKfAysTNSRBnqjvnd7Yc09lkAVBClOrYVQS8hEjgVbJLvxopFhI7IgHU0DYjPVC+Z3jLBJ1rpYy+U+gWAp+p8R0J8pca+qyvTRdWil4r/eZ0YvGov4UEUAwvobJAXCwwhToPBfS4ZBTHWhFDJ9a6YDokkFHR8eR2CvXjyMmlWyvZ5uXJzUaxVszhy6BgVUAnZ6BLV0DWqowai6Am9oDf0bjwbr8aH8TkrXTGyniP0B8bXL+1hmu8=</latexit>
1)
Figure 4: Diffusion process illustrated in [16]. Diffusion models include a forward pass that adds noises to
a clean image, and a reverse pass that recovers the clean image from its noisy counterpart.
1 − βt xt−1 , βt I)
p
q(xt |xt−1 ) := N(xt ; (3)
6
where T and βt are the diffusion steps and hyper-parameters, respectively. We only
discuss the case of Gaussian noise as transition kernels for simplicity, indicated as N
in Eq. 3. With αt := 1 − βt and ᾱt := ts=0 α s , we can obtain noised image at arbitrary
Q
√
q(xt |x0 ) := N(xt ; ᾱt x0 , (1 − ᾱt )I) (4)
Reverse pass. With the forward pass defined above, we can train the transition
kernels with a reverse process. Starting from pθ (T ), we hope the generated pθ (x0 ) can
follow the true data distribution q(x0 ). Therefore, the optimization objective of model
is as follows(quoted from [18]):
Considering the optimization objective similarities between DDPM and SGM, thy
are unified in [19] from the perspective of stochastic differential equations, allowing
more flexible sampling methods.
7
as classifier-free guidance. Specifically, classifier-free guidance jointly trains a sin-
gle model with unconditional score estimator ϵθ (x) and conditional ϵθ (x, c), where c
denotes the class label. A null token ∅ is placed as the class label in the uncondi-
tional part, i.e., ϵθ (x) = ϵθ (x, ∅). Experimental results in [20] show that classifier-free
guidance achieves a trade-off between quality and diversity similar to that achieved by
classifier guidance. Without resorting to a classifier, classifier-free diffusion facilitates
more modalities, e.g., text in text-to-image, as guidance.
8
Latent Space Conditioning
Diffusion Process Semantic
Map
Denoising U-Net Text
Repres
entations
Images
Pixel Space
Figure 5: Model architecture of Stable Diffusion [2]. Stable Diffusion first converts the image to a latent
space, where the diffusion process is performed. Stable Diffusion significantly improves the quality and
efficiency of image generation compared to prior models.
Imagen: encoding text with pretrained language model. Following GLIDE [9],
Imagen [21] adopts classifier-free guidance for image generation. A core difference
between GLIDE and Imagen lies in their choice of text encoder. Specifically, GLIDE
trains the text encoder together with the diffusion prior with paired image-text data,
while Imagen [21] adopts a pretrained and frozen large language model as the text en-
coder. Since the text-only corpus is significantly larger than paired image-text data,
such as 800GB used in T5 [22], the pretrained large language models are exposed
to text with a rich and wide distribution. With different T5 [22] variants as the text
encoder, [21] reveals that increasing the size of language model improves the im-
age fidelity and image-text alignment more than enlarging the diffusion model size in
Imagen. Moreover, freezing the weights of pretrained encoder facilitates offline text
embedding, which reduces negligible computation burden to the online training of the
text-to-image diffusion prior.
9
Figure 6: Model architecture of DALLE-2 [1]. DALLE-2 uses CLIP [23] model to project the image and
text to latent space.
VQ-VAE to learn a visual codebook, Stable diffusion applies VQ-GAN for the latent
representation in the first stage. Notebly, VQ-GAN improves VQ-VAE by adding an
adversarial objective to increase the naturalness of synthesized images. With the pre-
trained VAE, stable diffusion reverses a forward diffusion process that perturbs latent
space with noise. Stable diffusion also introduces cross-attention as general-purpose
conditioning for various condition signals like text. The model architecture of Stable
Diffusion is shown in Figure 5. Experimental results in [2] highlight that perform-
ing diffusion modeling on the latent space significantly outperforms that on the pixel
space in terms of complexity reduction and detail preservation. A similar approach has
also been investigated in VQ-diffusion with a mask-then-replace diffusion strategy. Re-
sembling the finding in pixel-space method, classifier-free guidance also significantly
improves the text-to-image diffusion models in latent space [2].
DALL-E2: with multimodal latent space. Another stream of text-to-image diffu-
sion models in latent space relies on multimodal contrasitve models [23], where image
embedding and text encoding are matched in the same representation space. For exam-
ple, CLIP [23] is a pioneering work learning the multimodal representations and has
been widely used in numerous text-to-image models [1]. A representative work apply-
ing CLIP is DALL-E 2, also known as unCLIP [1], which adopts the CLIP text encoder
but inverts the CLIP image encoder with a diffusion model that generates images from
CLIP latent space. Such a combination of encoder and decoder resembles the structure
10
of VAE adopted in LDM, even though the inverting decoder is non-deterministic [1].
Therefore, the remaining task is to train a prior to bridge the gap between CLIP text
and image latent space, and we term it as text-image latent prior for brevity. DALL-
E2 [1] finds that this prior can be learned by either autoregressive method or diffusion
model, but diffusion prior achieves superior performance. Moreover, experimental re-
sults show that removing this text-image latent prior leads to a performance drop by
a large margin [1], which highlights the importance of learning the text-image latent
prior. We show image examples generated by DALLE-2 in Figure 1.
Recent progress of Stable Diffusion and DALL-E family. Since the publication
of Stable Diffusion [2], multiple versions of models have been released, including Sta-
ble Diffusion 1.4, 1.5, 2.0, 2.1, XL, and 3. Starting from Stable Diffusion 2.0 [24],
a notable feature is negative prompts, which allow users to specify what they do not
wish to generate in the output image. Stable Diffusion XL [25] enhances capabilities
beyond previous versions by incorporating a larger Unet architecture, leading to im-
proved abilities such as face generation, richer visuals, and more impressive aesthetics.
Stable Diffusion 3 is built on diffusion transformer architecture [26] and use two sep-
arate sets of weights to model text and image modality. Stable Diffusion 3 improve
overall comprehension and typography of generated images. On the other hand, the
evolution of the DALL-E model has progressed from the autoregressive DALL-E [5],
to the diffusion-based DALL-E2 [1], and most recently, DALLE-3 [27]. Integrated
into the GPT-4 API, DALLE-3 showcases superior performance in capturing intricate
nuances and details.
4. Model advancements
On the choice of guidance. Beyond the classifier-free guidance, some works [9]
have also explored cross-modal guidance with CLIP [23]. Specifically, GLIDE [9]
11
finds that CLIP-guidance underperforms the classifier-free variant of guidance. By
contrast, another work UPainting [28] points out that lacking of a large-scale trans-
former language model makes these models with CLIP guidance difficult to encode
text prompts and generate complex scenes with details. By combing large language
model and cross-modal matching models, UPainting [28] significantly improves the
sample fidelity and image-text alignment of generated images. The general image syn-
thesis capability enables UPainting [28] to generate images in both simple and complex
scenes.
Denoising process. By default, DM during inference repeats the denoising process
on the same denoiser model, which makes sense for an unconditional image synthesis
since the goal is only to get a high-fidelity image. In the task of text-to-image synthesis,
the generated image is also required to align with the text, which implies that the de-
noiser model has to make a trade-off between these two goals. Specifically, two recent
works [29, 30] point out a phenomenon: the early sampling stage strongly relies on the
text prompt for the goal of aligning with the caption, but the later stage focuses on im-
proving image quality while almost ignoring the text guidance. Therefore, they abort
the practice of sharing model parameters during the denoising process and propose to
adopt multiple denoiser models which are specialized for different generation stages.
Specifically, ERNIE-ViLG 2.0 [29] also mitigates the problem of object-attribute by
the guidance of a text parser and object detector, improving the fine-grained semantic
control.
Model architecture. A branch of studies enhances text-to-image generation by
improving the denoising model. For instance, Free-U [31] strategically re-weights the
contributions sourced from the U-Net’s skip connections and backbone feature maps,
which improves image generation quality without additional training or fine-tuning.
The pioneering work DiT [26] proposes a diffusion transformer architecture as the
denoising model of diffusion models, which replaces the commonly-used U-Net back-
bone (see Figure 7). Pixart-α [32] is a pioneering work that adopts a transformer-based
backbone and supports high-resolution image synthesis up to 1024 × 1024 resolution
with low training cost.
Model acceleration. Diffusion models have achieved great success in image gen-
12
+ + +
𝛼"
Scale
Noise Σ Pointwise Pointwise
Pointwise Feedforward Feedforward
32 x 32 x 4 32 x 32 x 4
Feedforward
Layer Norm Layer Norm
𝛾",𝛽"
Linear and Reshape Scale, Shift
Layer Norm + +
Layer Norm
Multi-Head
+ Cross-Attention
Multi-Head
Nx DiT Block Scale
𝛼! Layer Norm Self-Attention
Layer Norm
Multi-Head +
Self-Attention
Patchify Embed
𝛾!,𝛽! Multi-Head
Scale, Shift Self-Attention Concatenate
on Sequence
Noised Layer Norm MLP Layer Norm Dimension
Timestep 𝑡
Latent
32 x 32 x 4 Label 𝑦 Input Tokens Conditioning Input Tokens Conditioning Input Tokens Conditioning
Latent Diffusion Transformer DiT Block with adaLN-Zero DiT Block with Cross-Attention DiT Block with In-Context Conditioning
Figure 7: Transformer architecture of DiT [26]. DiT trains conditional latent diffusion models with trans-
former blocks. Adaptive layer norm works best among all block types.
eration, outperforming GAN. However, one drawback of diffusion models is their slow
sampling process, which requires hundreds or thousands of iterations to generate an
image. V-prediction [33] improves the sampling speed by distilling a pre-trained diffu-
sion model with N-step DDIM sampler to a new model of N/2 sampling steps, without
hurting generation quality. Flow Matching (FM) [34] finds that employing FM with
diffusion paths results in a more robust and stable alternative for training diffusion
models. A recent work REPresentation Alignment (REPA) [35] emphasizes the key
role of representations in training large-scale diffusion models, and introduces the rep-
resentations from self-supervised models (DINO v2 [36]) to the training of diffusion
models like DiT [26] and SiT [37]. REPA [35] achieves significant acceleration results
by speeding up SiT [37] training by over 17.5×.
13
Figure 8: Textual inversion for concept control in Dreambooth [38]. Based on the user input images, Dream-
booth [38] finetunes a pretrained model to learn the key concept of subject in input images. The users can
further control the status of the subject by prompts such as “getting a haircut”.
14
a bear wearing sunglasses and a tie looks very proud
-- written by ChatGPT
“As the aurora lights up the sky, a herd of reindeer leisurely wanders on the grassy meadow, admiring the breathtaking view, a serene lake
quietly reflects the magnificent display, and in the distance, a snow-capped mountain stands majestically, fantasy, 8k, highly detailed”
“A smiling cat, a crown, a balloon, and a red bow” “A giraffe wearing sunglasses and a tie looks very proud” “A bear wearing sunglasses and a tie looks very proud”
Figure 9: Spatial control in BoxDiff [44]. BoxDiff enables to control the layout of generated images with
provided boxes or scribbles.
out the need for fine-tuning. Other representative studies for spatial control includes
BoxDiff [43] that uses the provided box or scribble to control the layout of generated
images, as shown in Figure 9.
Versatile content control. ControlNet [44] has achieved great attention due to
its powerful ability to add various conditioning controls to large pretrained models.
ControlNet [44] reuses the pretrained encoding layers as a strong backbone, and a
zero convolution architecture is proposed to ensure no harm noise could affect the
finetuning. ControlNet [44] achieves outstanding results with various conditioning
signals, such as edges, depth, and segmentation. Figure 10 shows an example from
[44] that use canny edge and human as condition to control the image generation of
Stable Diffusion model. There are also other widely used methods unifies various
signals in one model for content control, such as T2i-adapter [45], Uni-ControlNet [46],
GLIGEN [47], and Composer [48]. HumanSD [49] and HyperHuman [50] focuses on
the generation of human images by taking human skeleton as model inputs.
Retrieval for out-of-distribution generation. State-of-the-art text-to-image mod-
els assume sufficient exposure to descriptions of common entities and styles from
15
Input Canny edge Default “masterpiece of fairy tale, giant deer, golden antlers” “…, quaint city Galic”
Figure 10: Control Stable Diffusion with conditions [44]. ControlNet [44] allows users to specify conditions,
such as canny edges, in image generation of large-scale pretrained diffusion models. For example, the default
prompt is “a high-quality, detailed, and professional image”, while the users can add conditions such as
“quaint city Galic”.
training. This assumption breaks down with rare entities or vastly different styles,
leading to performance drops. To counter this, several studies [51, 52, 53, 54] use ex-
ternal databases for retrieval, a semi-parametric approach adapted from NLP [55, 56]
and GAN-based synthesis [57]. Retrieval-Augmented Diffusion Models (RDMs) [51]
use k-nearest neighbors (KNN) based on CLIP distance for enhanced diffusion guid-
ance, while KNN-diffusion [52] improves quality by adding text embeddings. Re-
Imagen [54] refines this with a single-stage framework, retrieving both images and text
in latent space, outperforming KNN-diffusion on the COCO benchmark.
5. Model evaluation
16
reference. The smaller the FID is, the higher the image fidelity. To measure the text-
image alignment, CLIP scores are widely applied, which trades off against FID. There
are also other metrics for text-to-image evaluation, including Inception score (IS) [58]
for image quality and R-precision for text-to-image generation.
Evaluation benchmarks. Apart Table 1: Image quality comparison of autoregressive
and diffusion models. Diffusion models outperforms
from the automatic metrics discussed
autoregressive models in image quality, with lower
above, multiple works involve human
FID on MS-COCO dataset.
evaluation and propose their new eval- model FID (↓)
Autoregressive CogView [6] 27.10
uation benchmarks [60, 21, 8, 28, 61,
LAFITE [59] 26.94
54, 62]. We summarize representa- DALLE [5] 17.89
tive benchmarks in Table 2. For a Diffusion models GLIDE [9] 12.24
Imagen [21] 7.27
better evaluation of fidelity and text-
Stable Diffusion [2] 12.63
image alignment, DrawBench[21], Par- DALL-E 2 [1] 10.39
tiPropts [8] and UniBench [28] ask the Upainting [28] 8.34
ERNIE-ViLG 2.0 [29] 6.75
human raters to compare generated im-
eDiff-I [30] 6.95
ages from different models. Specifically,
UniBench [28] proposes to evaluate the model on both simple and complex scenes
and includes both Chinese and English prompts. PartiPropts [8] introduces a di-
verse set of over 1600 (English) prompts and also proposes a challenge dimension
that highlights why this prompt is difficult. To evaluate the model from various as-
pects, PaintSKills [60] evaluates the visual reasoning skills and social biases apart
from image quality and text-image alignment. However, PaintSKills [60] only focuses
on unseen object-color and object-shape scenario [28]. EntityDrawBench [54] further
evaluates the model with various infrequent entities in different scenes. Compared to
PartiPropts [8] with prompts at different difficulty levels, Multi-Task Benchmark [61]
proposes thirty-two tasks that evaluate different capabilities and divides each task into
three difficulty levels.
17
Table 2: Benchmarks for text-to-image generation task.
Benchmark Measurement Metric Auto-eval Human-eval Language
DrawBench[21] Fidelity, alignment User preference rates N Y English
UniBench [28] Fidelity, alignment User preference rates N Y English, Chinese
PartiPrompts [8] Fidelity, alignment Qualitative N Y English
PaintSKills [60] Visual reasoning skills, social biases Statistics Y Y English
EntityDrawBench [54] Entity-centric faithfulness Human rating N Y English
Multi-Task Benchmark [61] Various capabilities Human rating N Y English
force the biases from the dataset, leading to ethical risks. [63] finds a large amount of
inappropriate content in the generated images by Stable diffusion [2] (e.g., offensive,
insulting, or threatening information), and first establishes a new test bed to evaluate
them. Moreover, it proposes Safe Latent Diffusion, which successfully removes and
suppresses inappropriate content with additional guidance. Another ethical issue, the
fairness of social group, is studied in [64, 65]. Specifically, [64] finds that simple ho-
moglyph replacements in the text descriptions can induce culture bias in models, i.e.,
generating images from different cultures. [65] introduce an Ethical NaTural Language
Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset, which can
evaluate the change of generated images with ethical interventions by three axes: gen-
der, skin color, and culture. With intervented text prompts, [65] improves diffusion
models (e.g., Stable diffusion [2]) from the social diversity perspective. Fair Diffu-
sion [66] evaluates the fairness problem of diffusion models and mitigates this problem
at the deployment stage of diffusion models. Specifically, Fair Diffusion [66] instructs
the diffusion models on fairness with textual guidance. Another work [67] finds that a
broad range of prompts to text-to-image diffusion models could produce stereotypes,
such as simply mentioning traits, descriptors, occupations, or objects.
Misuse for malicious purposes. Text-to-image diffusion models have shown their
power in generating high-quality images. However, this also raises great concern that
the generated images may be used for malicious purposes, e.g., falsifying electronic
evidence [68]. DE-FAKE [68] is the first to conduct a systematical study on visual
forgeries of the text-to-image diffusion models, which aims to distinguish generated
images from the real ones, and also further track the source model of each fake image.
To achieve these two goals, DE-FAKE [68] analyzes from visual modality perspective,
18
and finds that images generated by different diffusion models share common features
and also present unique model-wise fingerprints. Two concurrent works [69, 70] ap-
proach the detection of faked images both by evaluating the existing detection methods
on images generated by the diffusion model, and also analyzing the frequency discrep-
ancy of images by GAN and diffusion models. [69, 70] find that the performance of
existing detection methods drops significantly on generated images by diffusion mod-
els compared to GAN. Moreover, [69] attributes the failure of existing methods to
the mismatch of high frequencies between images generated by diffusion models and
GAN. Another work [71] discusses the concern of artistic image generation from the
perspective of artists. Although agreeing that the artistic image generation may be a
promising modality for the development of art, [71] points out that the artistic image
generation may cause plagiarism and profit shifting (profits in the art market shift from
artists to model owners) problems if not properly used.
Security and privacy risks. While text-to-image diffusion models have attracted
great attention, the security and privacy risks have been neglected so far. Two pio-
neering works [72, 73] discuss the backdoor attack and privacy issues, respectively.
Inspired by the findings in [64] that a simple word replacement can invert culture
bias to models, Rickrolling the Artist [72] proposes to inject the backdoors into the
pre-trained text encoders, which will force the generated image to follow a specific
description or include certain attributes if the trigger exists in the text prompt. [73] is
the first to analyze the membership leakage problem in text-to-image generation mod-
els, where whether a certain image is used to train the target text-to-image model is
inferred. Specifically, [73] proposes three intuitions on the membership information
and four attack methods accordingly. Experiments show that all the proposed attack
methods achieve impressive results, highlighting the threat of membership leakage.
The advancement of diffusion models has inspired various applications beyond im-
age generation, such as text-to-X where X refers to a modality such as video, and
text-guided image editing. We introduce pioneering work as follows.
19
Figure 11: Text-to-video generation by Stable Video Diffusion [74] from Stable AI.
6.1.1. Text-to-art
Artistic painting is an interesting and imaginative area that benefits from the suc-
cess of generative models. Despite the progress of GAN-based painting[75], they suf-
fer from the unstable training and model collapse problem brought by GAN. Recently,
multiple works have presented impressive images of paintings based on diffusion mod-
els, investigating improved prompts and different scenes. Multimodal guided artwork
diffusion (MGAD) [76] refines the generative process of diffusion model with mul-
timodal guidance (text and image) and achieves excellent results regarding both the
diversity and quality of generated digital artworks. In order to maintain the global con-
tent of the input image, DiffStyler [77] proposes a controllable dual diffusion model
with learnable noise in the diffusion process of the content image. During inference,
explicit content and abstract aesthetics can both be learned with two diffusion models.
Experimental results show that DiffStyler [77] achieve excellent results on both quan-
20
titative metrics and manual evaluation. To improve the creativity of Stable Diffusion
model, [78] proposes two directions of textual condition extension and model retrain-
ing with the Wikiart dataset, enabling the users to ask the famous artists to draw novel
images. [79] personalizes text-to-image generation by customizing the aesthetic styles
with a set of images, while [80] extends generated images to Scalable Vector Graph-
ics (SVGs) for digital icons or arts. In order to improve computation efficiency, [53]
proposes to generate artistic images based on retrieval-augmented diffusion models.
By retrieving neighbors from specialized datasets (e.g., Wikiart), [53] obtains fine-
grained control of the image style. In order to specify more fine-grained style features
(e.g., color distribution and brush strokes), [81] proposes supervised style guidance
and self-style guidance method, which can generate images of more diverse styles.
6.1.2. Text-to-video
Early studies. Since video is just a sequence of images, a natural application of
text-to-image is to make a video conditioned on the text input. Conceptually, text-to-
video DM lies in the intersection between text-to-image DM and video DM. Make-
A-Video [82] adopts a pretrained text-to-image DM to text-to-video and Video Im-
agen [83] extends an existing video DM method to text-to-video. Other representa-
tive text-to-video diffusion models include ModelScope [84], Tune-A-Video [85], and
VideoCrafter [86, 87, 88]. The success of text-to-video naturally inspires a future di-
rection of movie generation based on text inputs. Different from general text-to-video
tasks, story visualization requires the model to reason at each frame about whether to
maintain the consistency of actors and backgrounds between frames or scenes, based
on the story progress [89]. Make-A-Story [89] uses an autoregressive diffusion-based
framework and visual memory module to maintain consistency of actors and back-
grounds across frames, while AR-LDM [90] leverages image-caption history for co-
herent frame generation. Moreover, AR-LDM [90] shows the consistency for unseen
characters, and also the ability for real-world story synthesis on a newly introduced
dataset VIST [91].
Recent process. More recently, Stable Video Diffusion [74] from Stable AI achieves
significant performance improvements for text-to-video and image-to-video genera-
21
Figure 12: Sora [92] from Open AI. Sora represents video frames by compressing them to image patches
with a transformer-base backbone.
tion. It identifies and evaluates different training stages of applying diffusion models to
video synthesis, and introduces a systematic training process including the captioning
and filtering strategies. OpenAI launches the state-of-the-art video generation model
Sora [92], which is capable of generating a minute of high-fidelity video. Inspired by
large language models that turn different data (like text and code) into tokens, Sora [92]
first unifies diverse types of videos and images as patches and compresses them to a
lower-dimensional latent space, as shown in Figure 12. Sora then decoposes the rep-
resentations into spacetime patches and performs the diffusion process based on the
transformer backbone. As Sora [92] is not open-sourced yet, some studies aim to pro-
vide open access to advanced video generation models, such as Open-Sora [93] and
Open-Sora-Plan [94].
6.1.3. Text-to-3D
3D object generation is evidently much more sophisticated than 2D image synthesis
task. DeepFusion [95] is the first work that successfully applies diffusText-guided cre-
ative generation models to 3D object synthesis. Inspired by Dream Fields [96] which
applies 2D image-text models (i.e., CLIP) for 3D synthesis, DeepFusion [95] trains a
randomly initialized NeRF [97] with the distillation of a pretrained 2D diffusion model
(i.e., Imagen). However, according to Magic3D [98], the low-resolution image su-
pervision and extremely slow optimization of NeRF result in low-quality generation
and long processing time of DeepFusion [95]. For higher-resolution results, Magic3D
[98] proposes a coarse-to-fine optimization approach with coarse representation as ini-
tialization as the first step, and optimizing mesh representations with high-resolution
diffusion priors. Magic3D [98] also accelerates the generation process with a sparse
22
3D hash grid structure. 3DDesigner [99] focuses on another topic of 3D object genera-
tion, consistency, which indicates the cross-view correspondence. With low-resolution
results from NeRF-based condition module as the prior, a two-stream asynchronous
diffusion module further enhances the consistency and achieves 360-degree consistent
results. Apart from 3D object generation from text, recent work Zero-1-to-3 [100]
has achieved great attention by enabling zero-shot novel view synthesis and 3D recon-
struction from a single image, inspiring various follow-up studies such as Vivid-1-to-
3 [101].
Diffusion models not only significantly improve the quality of text-to-image syn-
thesis, but also enhance text-guided image editing. Before DM gained popularity, zero-
shot image editing had been dominated by GAN inversion methods [102, 103, 104, 105,
106, 107] combined with CLIP. However, GAN is often constrained to have limited in-
version capability, causing unintended changes to the image content. In this section,
we discuss pioneering studies for image editing based on diffusion models.
Inversion for image editing. A branch of studies edits the images by modifying
the noisy signals in the diffusion process. SDEdit[108] is a pioneering work that ed-
its images by iteratively denoising through a stochastic differential equation (SDE).
Without any task-specific training, SDEdit[108] first add noises to the input (such as
stroke painting), then subsequently denoises the noisy image through the SDE prior
to increase image realism. DiffusionCLIP [109] further adds text control in the edit-
ing process by fine-tuning the diffusion model at the reverse DDIM [110] process with
CLIP-based loss. Due to the local linearization assumptions, DDIM may lead to incor-
rect image reconstruction with the error propagation [111]. To mitigate this problem,
Exact Diffusion Inversion via Coupled Transformations (EDICT) [111] proposes to
maintain two coupled noise vectors in the diffusion process and achieves higher recon-
struction quality than DDIM [110]. Another work [112] introduces an accurate inver-
sion technique for text-guided editing by pivotal inversion and null-text optimization,
showing high-fidelity editing of various real images. To improve the editing efficiency,
LEDITS++ [113] proposes a novel inversion approach without tuning and optimiza-
23
Figure 13: Edit images by mask control in Blended Diffusion [114].
tion, which could produce high-fidelity results with a few diffusion steps.
Editing with mask control. A branch of work manipulates the image mainly on a
local (masked) region [114], as shown in Figure 13. The difficulty lies in guaranteeing
a seamless coherence between the masked region and the background. To guarantee
the seamless coherence between the edited region and the remaining part, Blended
diffusion [114] spatially blends noisy image with the local text-guided diffusion latent
in a progressive manner. This approached is further improved for a blended latent
diffusion model in [115] and a multi-stage variant in [116]. Different from [114, 115,
116] that requires a manually designed mask, DiffEdit [117] proposes to automatically
generate the mask to indicate which part to be edited.
Expanded editing with flexible texts. Some studies enable more types of image
editing with flexible text inputs. Imagic [118] is the first to perform text-based se-
mantic edits to a single image, such as postures or composition of multiple objects.
Specifically, Imagic [118] first obtains an optimized embedding for the target text, then
linearly interpolates between the target text embedding and the optimized one. This
generated representation is then sent to the fine-tuned model and generates the edited
images. To solve the problem that a simple modification of text prompt may leads to
a different output, Prompt-to-Prompt [119] proposes to use a cross-attention map dur-
ing the diffusion progress, which represents the relation between each image pixel and
word in the text prompt. InstructPix2Pix [120] works on the task of editing the image
24
“Swap sunflowers with roses” “Add fireworks to the sky” “Replace the fruits with cake”
“What would it look like if it were snowing?” “Turn it into a still from a western” “Make his jacket out of leather”
Figure 14: Image editing by InstructPix2Pix [120]. InstructPix2Pix [120] allows the users to edit an existing
image by simply giving textural instructions, such as “Swap sunflowers with roses”.
25
7. Challenges and outlook
7.1. Challenges
7.2. Outlook
Safe and fair applications. With the wide application of text-to-image models,
how to mitigate the ethical issues and security risks of current text-to-image models is
demanding and challenging. Possible directions include a more diverse and balanced
dataset to mitigate issues like race and gender, advanced methods for detecting gener-
ated images, and robust diffusion models against various attacks.
26
Unified multi-modality framework. Text-to-image generation can be seen as part
of the multi-modality learning. Most works focus on the single task of text-to-image
generation, but unifying multiple task into a single model can be a promising trend.
For example, UniD3 [127] unify text-to-image generation and image captioning with
a single diffusion model. The unified multi-modality model can boost each task by
learning representations from each modality better, and may bring more inspirations of
how model understands the multi-modality data.
Collaboration with other fields. In the past few years, deep learning has made
great progress in multiple areas, such as multi-modal GPT-4 [128]. Prior studies have
investigated how to collaborate diffusion models with models from other fields, such
as the recent work that deconstructs a diffusion model to an autoencoder [129] and
the adoption of GPT-3 [130] in InstructPix2Pix [120]. There are also studies applying
diffusion models in vision applications, such as image restoration [131], depth estima-
tion [132, 133], image enhancement [134] and classification [135]. Further collabo-
rations between text-to-image diffusion model and recent findings in active research
fields are an exciting topic to be explored.
References
[3] E. Mansimov, E. Parisotto et al., Generating images from captions with atten-
tion, ICLR (2016).
27
[6] M. Ding, Z. Yang et al., Cogview: Mastering text-to-image generation via trans-
formers, NeurIPS 34 (2021) 19822–19835.
[7] C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, N. Duan, Nüwa: Visual
synthesis pre-training for neural visual world creation, in: European Conference
on Computer Vision, Springer, 2022, pp. 720–736.
[10] F.-A. Croitoru, V. Hondru et al., Diffusion models in vision: A survey, IEEE
TPAMI (2023).
28
[17] Y. Song, S. Ermon, Generative modeling by estimating gradients of the data
distribution, in: NeurIPS, volume 32, 2019.
[26] W. Peebles, S. Xie, Scalable diffusion models with transformers, in: Proceed-
ings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
4195–4205.
29
[28] W. Li, X. Xu et al., Upainting: Unified text-to-image diffusion generation with
cross-modal guidance, arXiv:2210.16031 (2022).
[30] Y. Balaji, S. Nah et al., ediffi: Text-to-image diffusion models with an ensemble
of expert denoisers, arXiv:2211.01324 (2022).
[31] C. Si, Z. Huang, Y. Jiang, Z. Liu, Freeu: Free lunch in diffusion u-net, in:
CVPR, 2024, pp. 4733–4743.
[32] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu,
et al., Pixart-α: Fast training of diffusion transformer for photorealistic text-to-
image synthesis, arXiv preprint arXiv:2310.00426 (2023).
[33] T. Salimans, J. Ho, Progressive distillation for fast sampling of diffusion models,
in: ICLR, 2021.
30
[38] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman, Dreambooth:
Fine tuning text-to-image diffusion models for subject-driven generation, in:
Proceedings of the IEEE conference on computer vision and pattern recognition,
2023, pp. 22500–22510.
[39] R. Gal, Y. Alaluf et al., An image is worth one word: Personalizing text-to-
image generation using textual inversion, arXiv:2208.01618 (2022).
[46] S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, K.-Y. K. Wong, Uni-
controlnet: All-in-one control to text-to-image diffusion models, Advances in
Neural Information Processing Systems 36 (2024).
31
[47] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, Y. J. Lee, Gligen: Open-set
grounded text-to-image generation, in: CVPR, 2023, pp. 22511–22521.
32
[57] B. Li, P. H. Torr, T. Lukasiewicz, Memory-driven text-to-image generation,
arXiv:2208.07022 (2022).
[60] J. Cho, A. Zala, M. Bansal, Dall-eval: Probing the reasoning skills and social
biases of text-to-image generative transformers, arXiv:2202.04053 (2022).
[62] P. Liao, X. Li, X. Liu, K. Keutzer, The artbench dataset: Benchmarking genera-
tive models with artworks, arXiv:2206.11404 (2022).
[65] H. Bansal, D. Yin et al., How well can text-to-image generative models under-
stand ethical natural language interventions?, arXiv:2210.15230 (2022).
33
generation amplifies demographic stereotypes at large scale, in: Proceedings
of the 2023 ACM Conference on Fairness, Accountability, and Transparency,
2023, pp. 1493–1504.
[68] Z. Sha, Z. Li, N. Yu, Y. Zhang, De-fake: Detection and attribution of fake
images generated by text-to-image generation models, in: ACM SIGSAC CCS,
2023, pp. 3418–3432.
[73] Y. Wu, N. Yu, Z. Li, M. Backes, Y. Zhang, Membership inference attacks against
text-to-image generation models, arXiv:2210.00968 (2022).
[74] A. Blattmann, T. Dockhorn et al., Stable video diffusion: Scaling latent video
diffusion models to large datasets, arXiv preprint arXiv:2311.15127 (2023).
[76] N. Huang, F. Tang, W. Dong, C. Xu, Draw your art dream: Diverse digital art
synthesis with multimodal guided diffusion, in: ACM Multimedia, 2022, pp.
1085–1094.
34
[77] N. Huang, Y. Zhang, F. Tang, C. Ma, H. Huang, W. Dong, C. Xu, Diffstyler:
Controllable dual diffusion for text-driven image stylization, IEEE Transactions
on Neural Networks and Learning Systems (2024).
[81] Z. Pan, X. Zhou, H. Tian, Arbitrary style guidance for enhanced diffusion-based
text-to-image generation, in: Proceedings of the IEEE/CVF Winter Conference
on Applications of Computer Vision, 2023, pp. 4461–4471.
[83] J. Ho, W. Chan et al., Imagen video: High definition video generation with
diffusion models, arXiv:2210.02303 (2022).
[85] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie,
M. Z. Shou, Tune-a-video: One-shot tuning of image diffusion models for text-
to-video generation, in: Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2023, pp. 7623–7633.
[86] H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen,
X. Wang, C. Weng, Y. Shan, Videocrafter1: Open diffusion models for high-
quality video generation, 2023. arXiv:2310.19512.
35
[87] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, Y. Shan, Videocrafter2:
Overcoming data limitations for high-quality video diffusion models, 2024.
arXiv:2401.09047.
[88] J. Xing, M. Xia, Y. Zhang, H. Chen, X. Wang, T.-T. Wong, Y. Shan, Dynam-
icrafter: Animating open-domain images with video diffusion priors (2023).
arXiv:2310.12190.
[90] X. Pan, P. Qin, Y. Li, H. Xue, W. Chen, Synthesizing coherent story with auto-
regressive latent diffusion models, arXiv:2211.10950 (2022).
[93] Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y. Zhou, T. Li, Y. You, Open-
sora: Democratizing efficient video production for all, 2024. URL: https://
github.com/hpcaitech/Open-Sora.
[96] A. Jain, B. Mildenhall et al., Zero-shot text-guided object generation with dream
fields, in: Proceedings of the IEEE conference on computer vision and pattern
recognition, 2022, pp. 867–876.
36
[97] B. Mildenhall, P. P. e. a. Srinivasan, Nerf: Representing scenes as neural radi-
ance fields for view synthesis 65 (2021) 99–106.
[98] C.-H. Lin, J. Gao et al., Magic3d: High-resolution text-to-3d content creation,
in: Proceedings of the IEEE conference on computer vision and pattern recog-
nition, 2023, pp. 300–309.
[101] J.-g. Kwak, E. Dong, Y. Jin, H. Ko, S. Mahajan, K. M. Yi, Vivid-1-to-3: Novel
view synthesis with video diffusion models, in: CVPR, 2024, pp. 6775–6785.
[103] D. Bau, H. Strobelt, W. Peebles, J. Wulff, B. Zhou, J.-Y. Zhu, A. Torralba, Se-
mantic photo manipulation with a generative image prior, arXiv:2005.07727
(2020).
37
[107] V. V. Dere, A. Shinde, P. Vast, Conditional reiterative high-fidelity gan inversion
for image editing, Pattern Recognition 147 (2024) 110068.
[108] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, S. Ermon, SDEdit: Guided
image synthesis and editing with stochastic differential equations, in: Interna-
tional Conference on Learning Representations, 2022.
[111] B. Wallace, A. Gokul, N. Naik, Edict: Exact diffusion inversion via coupled
transformations, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2023, pp. 22532–22541.
38
[118] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, M. Irani,
Imagic: Text-based real image editing with diffusion models, arXiv:2210.09276
(2022).
[119] A. Hertz, R. Mokady et al., Prompt-to-prompt image editing with cross attention
control, arXiv:2208.01626 (2022).
[122] J. H. Liew, H. Yan, D. Zhou, J. Feng, Magicmix: Semantic mixing with diffusion
models, arXiv:2210.16056 (2022).
39
[127] M. Hu, C. Zheng et al., Unified discrete diffusion for simultaneous vision-
language generation, arXiv:2211.14842 (2022).
[129] X. Chen, Z. Liu, S. Xie, K. He, Deconstructing denoising diffusion models for
self-supervised learning, arXiv:2401.14404 (2024).
[131] Y. Liu, J. He, Y. Liu, X. Lin, F. Yu, J. Hu, Y. Qiao, C. Dong, Adaptbir: Adaptive
blind image restoration with latent diffusion prior for higher fidelity, Pattern
Recognition (2024) 110659.
[132] G. Kim, W. Jang, G. Lee, S. Hong, J. Seo, S. Kim, Depth-aware guidance with
self-estimated depth representations of diffusion models, Pattern Recognition
153 (2024) 110474.
[133] Y. Xu, S. Wu, B. Wang, M. Yang, Z. Wu, Y. Yao, Z. Wei, Two-stage fine-grained
image classification model based on multi-granularity feature fusion, Pattern
Recognition 146 (2024) 110042.
40