Imagic: Text-Based Real Image Editing With Diffusion Models
Imagic: Text-Based Real Image Editing With Diffusion Models
                                              Input Image            Edited Image               Input Image                Edited Image             Input Image         Edited Image
arXiv:2210.09276v2 [cs.CV] 22 Nov 2022
                                                                                                                      1
    Input Image                                                         Edited Images
                               “A sitting dog”     “A jumping dog”      “A dog lying down”    “A dog playing       “A jumping dog
                                                                                                with a toy”       holding a frisbee”
                             “A cat wearing a     “A cat wearing an      “A cat wearing a    “A cat wearing a    “A drawing of a cat”
                                    hat”               apron”               necklace”          jean jacket”
“A pistachio cake” “A chocolate cake” “A strawberry cake” “A wedding cake” “A slice of cake”
Figure 2. Different target texts applied to the same image. Imagic edits the same image differently depending on the input text.
sive when the desired edit is described by a simple natu-               put image, such as image masks indicating the desired edit
ral language text prompt, since this aligns well with human             location, multiple images of the same subject, or a text de-
communication. Many methods were developed for text-                    scribing the original image [6, 14, 36, 44, 48].
based image editing, showing promising results and con-                    In this paper, we propose a semantic image editing
tinually improving [7, 9, 30]. However, the current lead-               method that mitigates all the above problems. Given only an
ing methods suffer from, to varying degrees, several draw-              input image to be edited and a single text prompt describing
backs: (i) they are limited to a specific set of edits such as          the target edit, our method can perform sophisticated non-
painting over the image, adding an object, or transferring              rigid edits on real high-resolution images. The resulting im-
style [6, 30]; (ii) they can operate only on images from a              age outputs align well with the target text, while preserving
specific domain or synthetically generated images [17, 40];             the overall background, structure, and composition of the
or (iii) they require auxiliary inputs in addition to the in-           original image. For example, we can make two parrots kiss
                                                                    2
                         (A) Text Embedding Optimization                                                     (B) Model Fine-Tuning
                                       Reconstruction Loss                                                          Reconstruction Loss
Pre-Trained Pre-Trained
                                                                                                   + noise
                             + noise
                                          Diffusion Model                                                                  Diffusion Model
                                                                                                             eopt
                 Input                                                                  Input
Figure 3. Schematic description of Imagic. Given a real image and a target text prompt: (A) We encode the target text and get the initial
text embedding etgt , then optimize it to reconstruct the input image, obtaining eopt ; (B) We then fine-tune the generative model to improve
fidelity to the input image while fixing eopt ; (C) Finally, we interpolate eopt with etgt to generate the final editing result.
or make a person give the thumbs up, as demonstrated in                                 benchmark called TEdBench – Textual Editing Benchmark.
Figure 1. Our method, which we call Imagic, provides the                                We summarize our main contributions as follows:
first demonstration of text-based semantic editing that ap-                             1. We present Imagic, the first text-based semantic image
plies such sophisticated manipulations to a single real high-                              editing technique that allows for complex non-rigid edits
resolution image, including editing multiple objects. In ad-                               on a single real input image, while preserving its overall
dition, Imagic can also perform a wide variety of edits, in-                               structure and composition.
cluding style changes, color changes, and object additions.                             2. We demonstrate a semantically meaningful linear inter-
    To achieve this feat, we take advantage of the recent suc-                             polation between two text embedding sequences, uncov-
cess of text-to-image diffusion models [44, 47, 50]. Diffu-                                ering strong compositional capabilities of text-to-image
sion models are powerful state-of-the-art generative models,                               diffusion models.
capable of high quality image synthesis [13,19]. When con-                              3. We introduce TEdBench – a novel and challenging com-
ditioned on natural language text prompts, they are able to                                plex image editing benchmark, which enables compar-
generate images that align well with the requested text. We                                isons of different text-based image editing methods.
adapt them in our work to edit real images instead of syn-
thesizing new ones. We do so in a simple 3-step process, as                             2. Related Work
depicted in Figure 3: We first optimize a text embedding so                                 Following recent advancements in image synthesis qual-
that it results in images similar to the input image. Then, we                          ity [23–26], many works utilized the latent space of pre-
fine-tune the pre-trained generative diffusion model (condi-                            trained generative adversarial networks (GANs) to perform
tioned on the optimized embedding) to better reconstruct                                a variety of image manipulations [3,16,33,40,53,54]. Mul-
the input image. Finally, we linearly interpolate between                               tiple techniques for applying such manipulations on real im-
the target text embedding and the optimized one, resulting                              ages were suggested, including optimization-based meth-
in a representation that combines both the input image and                              ods [1, 2, 22], encoder-based methods [4, 45, 61], and meth-
the target text. This representation is then passed to the gen-                         ods adjusting the model per input [5, 8, 46]. In addition
erative diffusion process with the fine-tuned model, which                              to GAN-based methods, some techniques utilize other deep
outputs our final edited image.                                                         learning-based systems for image editing [7, 11].
    We conduct several experiments and apply our method                                     More recently, diffusion models were utilized for similar
on numerous images from various domains. Our method                                     image manipulation tasks, showcasing remarkable results.
outputs high quality images that both resemble the input                                SDEdit [35] adds intermediate noise to an image (possibly
image to a high degree, and align well with the target                                  augmented by user-provided brush strokes), then denoises
text. These results showcase the generality, versatility, and                           it using a diffusion process conditioned on the desired edit,
quality of Imagic. We additionally conduct an ablation                                  which is limited to global edits. DDIB [59] encodes an input
study, highlighting the effect of each element of our method.                           image using DDIM inversion with a source class (or text),
When compared to recent approaches suggested in the lit-                                and decodes it back conditioned on the target class (or text)
erature, Imagic exhibits significantly better editing qual-                             to obtain an edited version. DiffusionCLIP [30] utilizes
ity and faithfulness to the original image, especially when                             language-vision model gradients, DDIM inversion [56], and
tasked with sophisticated non-rigid edits. This is further                              model fine-tuning to edit images using a domain-specific
supported by a human perceptual evaluation study, where                                 diffusion model. It was also suggested to edit images by
raters strongly prefer Imagic over other methods on a novel                             synthesizing data in user-provided masks, while keeping the
                                                                                    3
           Input Image                                 Edited images using different random seeds
rest of the image intact [6, 36]. Liu et al. [34] guide a dif-           random Gaussian noise perturbation, obtaining xt´1 . The
fusion process with a text and an image, synthesising im-                network is trained for a simple denoising objective, aiming
ages similar to the given one, and aligned with the given                for fθ pxt , tq « t [19, 55]. This leads to a learned image
text. Hertz et al. [17] alter a text-to-image diffusion pro-             distribution with high fidelity to the target distribution, en-
cess by manipulating cross-attention layers, providing more              abling stellar generative performance.
fine-grained control over generated images, and can edit                    This method can be generalized for learning conditional
real images in cases where DDIM inversion provides mean-                 distributions – by conditioning the denoising network on an
ingful attention maps. Textual Inversion [14] and Dream-                 auxiliary input y, the network fθ pxt , t, yq and its resulting
Booth [48] synthesize novel views of a given subject given               diffusion process can faithfully sample from a data distribu-
3–5 images of the subject and a target text (rather than edit a          tion conditioned on y. The conditioning input y can be a
single image), with DreamBooth requiring additional gen-                 low-resolution version of the desired image [51] or a class
erated images for fine-tuning the models. In this work, we               label [20]. Furthermore, y can also be on a text sequence
provide the first text-based semantic image editing tool that            describing the desired image [44, 47, 50]. By incorporating
operates on a single real image, maintains high fidelity to it,          knowledge from large language models (LLMs) [43] or hy-
and applies non-rigid edits given a single free-form natural             brid vision-language models [42], these text-to-image dif-
language text prompt.                                                    fusion models have unlocked a new capability – users can
                                                                         generate realistic high-resolution images using only a text
3. Imagic: Diffusion-Based Real Image Editing                            prompt describing the desired scene. In all these methods, a
                                                                         low-resolution image is first synthesized using a generative
3.1. Preliminaries
                                                                         diffusion process, and then it is transformed into a high-
    Diffusion models [19, 55, 57, 63] are a family of gener-             resolution one using additional auxiliary models.
ative models that has recently gained traction, as they ad-
vanced the state-of-the-art in image generation [13, 28, 58,             3.2. Our Method
62], and have been deployed in various downstream appli-                    Given an input image x and a target text which describes
cations such as image restoration [27, 49], adversarial pu-              the desired edit, our goal is to edit the image in a way that
rification [10, 37], image compression [60], image classifi-             satisfies the given text, while preserving a maximal amount
cation [66], and others [12, 15, 29, 41, 52, 64].                        of detail from x (e.g., small details in the background and
    The core premise of these models is to initialize with a             the identity of the object within the image). To achieve this
randomly sampled noise image xT „ N p0, Iq, then itera-                  feat, we utilize the text embedding layer of the diffusion
tively refine it in a controlled fashion, until it is synthesized        model to perform semantic manipulations. Similar to GAN-
into a photorealistic image x0 . Each intermediate sample                based approaches [40,46,61], we begin by finding meaning-
xt (for t P t0, . . . , T u) satisfies                                   ful representation which, when fed through the generative
                           ?           ?
                  xt “ αt x0 ` 1 ´ αt t ,                    (1)        process, yields images similar to the input image. We then
with 0 “ αT ă αT ´1 ă ¨ ¨ ¨ ă α1 ă α0 “ 1 being hyper-                   fine-tune the generative model to better reconstruct the in-
parameters of the diffusion schedule, and t „ N p0, Iq.                 put image and finally manipulate the latent representation
Each refinement step consists of an application of a neural              to obtain the edit result.
network fθ pxt , tq on the current sample xt , followed by a                More formally, as depicted in Figure 3, our method con-
                                                                     4
                                                         Increasing η
           Input Image                                                                                  Edited Image
sists of 3 stages: (i) we optimize the text embedding to find       point eopt . In parallel, we fine-tune any auxiliary diffusion
one that best matches the given image in the vicinity of the        models present in the underlying generative method, such as
target text embedding; (ii) we fine-tune the diffusion models       super-resolution models. We fine-tune them with the same
to better match the given image; and (iii) we linearly inter-       reconstruction loss, but conditioned on etgt , as eopt is op-
polate between the optimized embedding and the target text          timized for the base model only. The optimization of these
embedding, in order to find a point that achieves both fi-          auxiliary models ensures the preservation of high-frequency
delity to the input image and target text alignment. We now         details from x that are not present in the base resolution.
turn to describe each step in more detail.
                                                                    Text embedding interpolation Since the generative dif-
Text embedding optimization The target text is first                fusion model was trained to fully recreate the input image
passed through a text encoder [43], which outputs its cor-          x at the optimized embedding eopt , we use it to apply the
responding text embedding etgt P RT ˆd , where T is the             desired edit by advancing in the direction of the target text
number of tokens in the given target text, and d is the to-         embedding etgt . More formally, our third stage is a sim-
ken embedding dimension. We then freeze the parameters              ple linear interpolation between etgt and eopt . For a given
of the generative diffusion model fθ , and optimize the tar-        hyperparameter η P r0, 1s, we obtain
get text embedding etgt using the denoising diffusion ob-
                                                                                   ē “ η ¨ etgt ` p1 ´ ηq ¨ eopt ,           (3)
jective [19]:
                            ”                     ı
                                                2                   which is the embedding that represents the desired edited
          Lpx, e, θq “ Et, } ´ fθ pxt , t, eq}2 ,      (2)
                                                                    image. We then apply the base generative diffusion process
where t„U nif ormr1, T s, xt is a noisy version of x (the in-       using the fine-tuned model, conditioned on ē. This results in
put image) obtained using „N p0, Iq and Equation 1, and            a low-resolution edited image, which is then super-resolved
θ are the pre-trained diffusion model weights. This results         using the fine-tuned auxiliary models, conditioned on the
in a text embedding that matches our input image as closely         target text. This generative process outputs our final high-
as possible. We run this process for relatively few steps, in       resolution edited image x̄.
order to remain close to the initial target text embedding,         3.3. Implementation Details
obtaining eopt . This proximity enables meaningful linear
                                                                        Our framework is general and can be combined with
interpolation in the embedding space, which does not ex-
                                                                    different generative models. We demonstrate it using two
hibit linear behavior for distant embeddings.
                                                                    different state-of-the-art text-to-image generative diffusion
Model fine-tuning Note that the obtained optimized em-              models: Imagen [50] and Stable Diffusion [47].
bedding eopt does not necessarily lead to the input image x             Imagen [50] consists of 3 separate text-conditioned dif-
exactly when passed through the generative diffusion pro-           fusion models: (i) a generative diffusion model for 64ˆ64-
cess, as our optimization runs for a small number of steps          pixel images; (ii) a super-resolution (SR) diffusion model
(see top left image in Figure 7). Therefore, in the second          turning 64ˆ64-pixel images into 256ˆ256 ones; and
stage of our method, we close this gap by optimizing the            (iii) another SR model transforming 256ˆ256-pixel images
model parameters θ using the same loss function presented           into the 1024ˆ1024 resolution. By cascading these 3 mod-
in Equation 2, while freezing the optimized embedding.              els [20] and using classifier-free guidance [21], Imagen con-
This process shifts the model to fit the input image x at the       stitutes a powerful text-guided image generation scheme.
                                                                5
                                                                                   In addition to Imagen, we also implement our method
                                                                                with the publicly available Stable Diffusion model (based
Input
                                                                                4. Experiments
DDIB
                                                                            6
                                    Input Image:                     , Target Text: “A photo of a pistachio cake”
Figure 7. Embedding interpolation. Varying η with the same seed, using the pre-trained (top) and fine-tuned (bottom) models.
                                                                                                                                                Figure 9. Editability–fidelity
   Imagic     SDEdit      Imagic     DDIB         Imagic       Text2LIVE
100%                   100%                      100%                                                                                           tradeoff. CLIP score (target
                                                                                                                               Text Alignment
                                                                                      Image Fidelity
80%                    80%                       80%                                                                                            text alignment) and 1´LPIPS
60%                    60%                       60%                                                                                            (input image fidelity) as func-
40%                    40%                       40%                                                                                            tions of η, averaged over 150
20%                    20%                       20%                                                                                            inputs. Edited images tend to
 0%                     0%                        0%                                                                                            match both the input image and
Figure 8. User study results. Preference rates (with 95%                                                         η                              text in the highlighted area.
confidence intervals) for image editing quality of Imagic over
SDEdit [35], DDIB [59], and Text2LIVE [7].                                                performed using Amazon Mechanical Turk. Participants
                                                                                          were shown an input image and a target text, and were
posed technique with the same Imagen [50] model and tar-                                  asked to choose the better editing result from one of two op-
get text prompt that we use. We keep the diffusion hyper-                                 tions, using the standard practice of Two-Alternative Forced
parameters from Imagen, and choose the intermediate dif-                                  Choice (2AFC) [7,32,39]. The options to choose from were
fusion timestep for SDEdit independently for each image                                   our result and a baseline result from one of: SDEdit [35],
to achieve the best target text alignment without drastically                             DDIB [59], or Text2LIVE [7]. In total, we collected 10346
changing the image contents. For DDIB, we provide an ad-                                  answers, whose results are summarized in Figure 8. As can
ditional source text.                                                                     be seen, evaluators exhibit a strong preference towards our
    Figure 6 shows editing results of different methods. For                              method, with a preference rate of more than 65% across
SDEdit and Imagic, we sample 8 images using different ran-                                all considered baselines. See the appendix for more details
dom seeds and display the result with the best alignment                                  about the user study and method implementations.
to both the target text and the input image. As can be ob-
served, our method maintains high fidelity to the input im-                               4.4. Ablation Study
age while aptly performing the desired edits. When tasked
with a complex non-rigid edit such as making a dog sit,                                   Fine-tuning and optimization We generate edited im-
our method significantly outperforms previous techniques.                                 ages for different η values using the pre-trained 64 ˆ 64
Imagic constitutes the first demonstration of such sophisti-                              diffusion model and our fine-tuned one, in order to gauge
cated text-based edits applied on a single real-world image.                              the effect of fine-tuning on the output quality. We use the
We verify this claim through a user study in subsection 4.3.                              same optimized embedding and random seed, and qualita-
                                                                                          tively evaluate the results in Figure 7. Without fine-tuning,
4.3. TEdBench and User Study                                                              the scheme does not fully reconstruct the original image
                                                                                          at η “ 0, and fails to retain the image’s details as η in-
   Text-based image editing methods are a relatively recent
                                                                                          creases. In contrast, fine-tuning imposes details from the
development, and Imagic is the first to apply complex non-
                                                                                          input image beyond just the optimized embedding, allow-
rigid edits. As such, no standard benchmark exists for eval-
                                                                                          ing our scheme to retain these details for intermediate val-
uating non-rigid text-based image editing. We introduce
                                                                                          ues of η, thereby enabling semantically meaningful linear
TEdBench (Textual Editing Benchmark), a novel collection
                                                                                          interpolation. Thus, we conclude that model fine-tuning is
of 100 pairs of input images and target texts describing a de-
                                                                                          essential for our method’s success. Furthermore, we ex-
sired complex non-rigid edit. We hope that future research
                                                                                          periment with the number of text embedding optimization
will benefit from TEdBench as a standardized evaluation set
                                                                                          steps in the appendix. Our findings suggest that optimizing
for this task.
                                                                                          the text embedding with a smaller number of steps limits
   We quantitatively evaluate Imagic’s performance via an
                                                                                          our editing capabilities, while optimizing for more than 100
extensive human perceptual evaluation study on TEdBench,
                                                                                          steps yields little to no added value.
                                                                               7
                                                                           Input Image   Edited Image     Input Image     Edited Image
Interpolation intensity As can be observed in Figure 7,
fine-tuning increases the η value at which the model strays
from reconstructing the input image. While the optimal η
value may vary per input (as different edits require differ-
ent intensities), we attempt to identify the region in which
                                                                       Target Text:      “A photo of a                  “A dog lying down”
the edit is best applied. To that end, we apply our editing                               traffic jam”
scheme with different η values, and calculate the outputs’
CLIP score [18, 42] w.r.t. the target text, and their LPIPS
score [65] w.r.t. the input image subtracted from 1. A higher
CLIP score indicates better output alignment with the target
text, and a higher 1´LPIPS indicates higher fidelity to the            Target Text:       “A photo of                      “Pizza with
input image. We repeat this process for 150 image-text in-                                a race car”                      pepperoni”
puts, and show the average results in Figure 9. We observe             Figure 10. Failure cases. Insufficient consistency with the target
that for η values smaller than 0.4, outputs are almost identi-         text (top), or changes in camera viewing angle (bottom).
cal to the input images. For η P r0.6, 0.8s, the images begin
to change (according to LPIPS), and align better with the
                                                                       5. Conclusions and Future Work
text (as the CLIP score rises). Therefore, we identify this
area as the most probable for obtaining satisfactory results.              We propose a novel image editing method called Imagic.
Note that while they provide a good sense of text or image             Our method accepts a single image and a simple text prompt
alignment on average, CLIP score and LPIPS are imprecise               describing the desired edit, and aims to apply this edit while
measures that rely on neural network backbones, and their              preserving a maximal amount of details from the image.
values noticeably differ for each different input image-text           To that end, we utilize a pre-trained text-to-image diffusion
pair. As such, they are not suited for reliably choosing η             model and use it to find a text embedding that represents
for each input in an automatic way, nor can they faithfully            the input image. Then, we fine-tune the diffusion model to
assess an editing method’s performance.                                fit the image better, and finally we linearly interpolate be-
                                                                       tween the embedding representing the image and the target
                                                                       text embedding, obtaining a semantically meaningful mix-
4.5. Limitations
                                                                       ture of them. This enables our scheme to provide edited im-
    We identify two main failure cases of our method: In               ages using the interpolated embedding. Contrary to other
some cases, the desired edit is applied very subtly (if at all),       editing methods, our approach can produce sophisticated
therefore not aligning well with the target text. In other             non-rigid edits that may alter the pose, geometry, and/or
cases, the edit is applied well, but it affects extrinsic image        composition of objects within the image as requested, in
details such as zoom or camera angle. We show examples                 addition to simpler edits such as style or color. It requires
of these two failure cases in the first and second row of Fig-         the user to provide only a single image and a simple target
ure 10, respectively. When the edit is not applied strongly            text prompt, without the need for additional auxiliary inputs
enough, increasing η usually achieves the desired result, but          such as image masks.
it sometimes leads to a significant loss of original image de-             Our future work may focus on further improving the
tails (for all tested random seeds) in a handful of cases. As          method’s fidelity to the input image and identity preserva-
for zoom and camera angle changes, these usually occur be-             tion, as well as its sensitivity to random seeds and to the
fore the desired edit takes place, as we progress from a low           interpolation parameter η. Another intriguing research di-
η value to a large one, which makes circumventing them                 rection would be the development of an automated method
difficult. We demonstrate this in the appendix, and include            for choosing η for each requested edit.
additional failure cases in TEdBench as well.                          Societal Impact Our method aims to enable complex
    These limitations can possibly be mitigated by optimiz-            editing of real world images using textual descriptions of
ing the text embedding or the diffusion model differently,             the target edit. As such, it is prone to societal biases of the
or by incorporating cross-attention control akin to Hertz et           underlying text-based generative models, albeit to a lesser
al. [17]. We leave those options for future work. Also, since          extent than purely generative methods since we rely mostly
our method relies on a pre-trained text-to-image diffusion             on the input image for editing. However, as with other ap-
model, it inherits the model’s generative limitations and bi-          proaches that use generative models for image editing, such
ases. Therefore, unwanted artifacts are produced when the              techniques might be used by malicious parties for synthe-
desired edit involves generating failure cases of the under-           sizing fake imagery to mislead viewers. To mitigate this,
lying model. For instance, Imagen is known to show sub-                further research on the identification of synthetically edited
standard generative performance on human faces [50].                   or generated content is needed.
                                                                   8
Acknowledgements                                                        [12] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W
                                                                             Cohen. Re-imagen: Retrieval-augmented text-to-image gen-
   This work was done during an internship at Google Re-                     erator. arXiv preprint arXiv:2209.14491, 2022. 4
search. We thank William Chan, Chitwan Saharia, and Mo-                 [13] Prafulla Dhariwal and Alexander Nichol. Diffusion models
hammad Norouzi for providing us with their support and                       beat gans on image synthesis. Advances in Neural Informa-
access to the Imagen source code and pre-trained models.                     tion Processing Systems, 34:8780–8794, 2021. 3, 4
We also thank Michael Rubinstein and Nataniel Ruiz for                  [14] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik,
insightful discussions during the development of this work.                  Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An
                                                                             image is worth one word: Personalizing text-to-image gen-
References                                                                   eration using textual inversion, 2022. 2, 4
                                                                        [15] Jin Gao, Jialing Zhang, Xihui Liu, Trevor Darrell, Evan
 [1] Rameen Abdal, Yipeng Qin, and Peter Wonka.               Im-            Shelhamer, and Dequan Wang.             Back to the source:
     age2stylegan: How to embed images into the stylegan latent              Diffusion-driven test-time adaptation.         arXiv preprint
     space? In Proceedings of the IEEE international conference              arXiv:2207.03442, 2022. 4
     on computer vision, pages 4432–4441, 2019. 3                       [16] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and
 [2] Rameen Abdal, Yipeng Qin, and Peter Wonka.               Im-            Sylvain Paris. Ganspace: Discovering interpretable gan con-
     age2stylegan++: How to edit the embedded images? In Pro-                trols. arXiv preprint arXiv:2004.02546, 2020. 3
     ceedings of the IEEE/CVF Conference on Computer Vision             [17] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
     and Pattern Recognition, pages 8296–8305, 2020. 3                       Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image
 [3] Rameen Abdal, Peihao Zhu, Niloy Mitra, and Peter Wonka.                 editing with cross attention control, 2022. 2, 4, 8
     Styleflow: Attribute-conditioned exploration of stylegan-          [18] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
     generated images using conditional continuous normalizing               and Yejin Choi. Clipscore: A reference-free evaluation met-
     flows, 2020. 3                                                          ric for image captioning. arXiv preprint arXiv:2104.08718,
 [4] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle:               2021. 8
     A residual-based stylegan encoder via iterative refinement.        [19] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
     In Proceedings of the IEEE/CVF International Conference                 sion probabilistic models. Advances in Neural Information
     on Computer Vision (ICCV), October 2021. 3                              Processing Systems, 33:6840–6851, 2020. 3, 4, 5
 [5] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and                 [20] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
     Amit H Bermano. Hyperstyle: Stylegan inversion with                     Mohammad Norouzi, and Tim Salimans. Cascaded diffu-
     hypernetworks for real image editing.         arXiv preprint            sion models for high fidelity image generation. Journal of
     arXiv:2111.15666, 2021. 3                                               Machine Learning Research, 23(47):1–33, 2022. 4, 5
 [6] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended            [21] Jonathan Ho and Tim Salimans. Classifier-free diffusion
     diffusion for text-driven editing of natural images. In Pro-            guidance. In NeurIPS 2021 Workshop on Deep Generative
     ceedings of the IEEE/CVF Conference on Computer Vision                  Models and Downstream Applications, 2021. 5
     and Pattern Recognition, pages 18208–18218, 2022. 2, 4             [22] Ali Jahanian, Lucy Chai, and Phillip Isola. On the” steer-
 [7] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas-                ability” of generative adversarial networks. In International
     ten, and Tali Dekel. Text2LIVE: text-driven layered image               Conference on Learning Representations, 2019. 3
     and video editing. arXiv preprint arXiv:2204.02491, 2022.          [23] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,
     2, 3, 6, 7, 15                                                          Jaakko Lehtinen, and Timo Aila. Training generative adver-
 [8] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff,              sarial networks with limited data. In Proc. NeurIPS, 2020.
     Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Seman-                   3
     tic photo manipulation with a generative image prior. arXiv        [24] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen,
     preprint arXiv:2005.07727, 2020. 3                                      Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free
 [9] Amit H Bermano, Rinon Gal, Yuval Alaluf, Ron Mokady,                    generative adversarial networks. Advances in Neural Infor-
     Yotam Nitzan, Omer Tov, Oren Patashnik, and Daniel                      mation Processing Systems, 34, 2021. 3
     Cohen-Or. State-of-the-art in the architecture, methods and        [25] Tero Karras, Samuli Laine, and Timo Aila. A style-based
     applications of stylegan. In Computer Graphics Forum, vol-              generator architecture for generative adversarial networks. In
     ume 41, pages 591–611. Wiley Online Library, 2022. 2                    Proceedings of the IEEE conference on computer vision and
[10] Tsachi Blau, Roy Ganz, Bahjat Kawar, Alex Bronstein, and                pattern recognition, pages 4401–4410, 2019. 3
     Michael Elad. Threat model-agnostic adversarial defense            [26] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
     using diffusion models. arXiv preprint arXiv:2207.08089,                Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
     2022. 4                                                                 ing the image quality of stylegan. In Proceedings of the
[11] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T                IEEE/CVF Conference on Computer Vision and Pattern
     Freeman. Maskgit: Masked generative image transformer.                  Recognition, pages 8110–8119, 2020. 3
     In Proceedings of the IEEE/CVF Conference on Computer              [27] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming
     Vision and Pattern Recognition, pages 11315–11325, 2022.                Song. Denoising diffusion restoration models. In Advances
     3                                                                       in Neural Information Processing Systems, 2022. 4
                                                                    9
[28] Bahjat Kawar, Roy Ganz, and Michael Elad. Enhancing                          stylegan imagery. arXiv preprint arXiv:2103.17249, 2021.
     diffusion-based image synthesis with robust classifier guid-                 2, 3, 4
     ance. arXiv preprint arXiv:2208.08664, 2022. 4                        [41]   Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima
[29] Bahjat Kawar, Jiaming Song, Stefano Ermon, and Michael                       Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion prob-
     Elad. JPEG artifact correction using denoising diffusion                     abilistic model for text-to-speech. In International Confer-
     restoration models. arXiv preprint arXiv:2209.11888, 2022.                   ence on Machine Learning, pages 8599–8608. PMLR, 2021.
     4                                                                            4
[30] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif-                   [42]   Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
     fusionclip: Text-guided diffusion models for robust image                    Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
     manipulation. In Proceedings of the IEEE/CVF Conference                      Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
     on Computer Vision and Pattern Recognition, pages 2426–                      ing transferable visual models from natural language super-
     2435, 2022. 2, 3                                                             vision. In International Conference on Machine Learning,
[31] Diederik P. Kingma and Jimmy Ba. Adam: A method for                          pages 8748–8763. PMLR, 2021. 4, 8
     stochastic optimization. In Yoshua Bengio and Yann LeCun,             [43]   Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
     editors, 3rd International Conference on Learning Represen-                  Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
     tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,                       Peter J Liu. Exploring the limits of transfer learning with a
     Conference Track Proceedings, 2015. 6                                        unified text-to-text transformer. Journal of Machine Learn-
[32] Nicholas Kolkin,         Jason Salavon,       and Gregory                    ing Research, 21:1–67, 2020. 4, 5, 13
     Shakhnarovich. Style transfer by relaxed optimal transport            [44]   Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
     and self-similarity. In Proceedings of the IEEE/CVF Con-                     and Mark Chen. Hierarchical text-conditional image gen-
     ference on Computer Vision and Pattern Recognition, pages                    eration with clip latents. arXiv preprint arXiv:2204.06125,
     10051–10060, 2019. 7, 15                                                     2022. 2, 3, 4
[33] Oran Lang, Yossi Gandelsman, Michal Yarom, Yoav Wald,                 [45]   Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
     Gal Elidan, Avinatan Hassidim, William T. Freeman, Phillip                   Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding
     Isola, Amir Globerson, Michal Irani, and Inbar Mosseri. Ex-                  in style: a stylegan encoder for image-to-image translation.
     plaining in style: Training a gan to explain a classifier in                 arXiv preprint arXiv:2008.00951, 2020. 3
     stylespace. In Proceedings of the IEEE/CVF International              [46]   Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel
     Conference on Computer Vision, pages 693–702, 2021. 3                        Cohen-Or. Pivotal tuning for latent-based editing of real im-
[34] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang,                         ages. ACM Transactions on Graphics (TOG), 42(1):1–13,
     Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna                               2022. 3, 4
     Rohrbach, and Trevor Darrell. More control for free! image            [47]   Robin Rombach, Andreas Blattmann, Dominik Lorenz,
     synthesis with semantic diffusion guidance. arXiv preprint                   Patrick Esser, and Björn Ommer. High-resolution image
     arXiv:2112.05744, 2021. 4                                                    synthesis with latent diffusion models. In Proceedings of
[35] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-                       the IEEE/CVF Conference on Computer Vision and Pattern
     jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: guided                       Recognition, pages 10684–10695, 2022. 3, 4, 5, 6
     image synthesis and editing with stochastic differential equa-        [48]   Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
     tions. In International Conference on Learning Representa-                   Michael Rubinstein, and Kfir Aberman. DreamBooth: fine
     tions, 2021. 3, 6, 7, 15                                                     tuning text-to-image diffusion models for subject-driven
[36] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav                        generation. arXiv preprint arxiv:2208.12242, 2022. 2, 4
     Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and                [49]   Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
     Mark Chen. Glide: Towards photorealistic image generation                    Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
     and editing with text-guided diffusion models. arXiv preprint                Norouzi. Palette: Image-to-image diffusion models. In
     arXiv:2112.10741, 2021. 2, 4                                                 ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
[37] Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash                     10, 2022. 4
     Vahdat, and Anima Anandkumar. Diffusion models for ad-                [50]   Chitwan Saharia, William Chan, Saurabh Saxena, Lala
     versarial purification. In International Conference on Ma-                   Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
     chine Learning (ICML), 2022. 4                                               Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
[38] Byong Mok Oh, Max Chen, Julie Dorsey, and Frédo Durand.                     Rapha Gontijo Lopes, et al. Photorealistic text-to-image dif-
     Image-based modeling and photo editing. In Proceedings of                    fusion models with deep language understanding. In Ad-
     the 28th annual conference on Computer graphics and inter-                   vances in Neural Information Processing Systems, 2022. 3,
     active techniques, pages 433–442, 2001. 1                                    4, 5, 6, 7, 8, 13
[39] Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli               [51]   Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali-
     Shechtman, Alexei Efros, and Richard Zhang. Swapping au-                     mans, David J Fleet, and Mohammad Norouzi. Image super-
     toencoder for deep image manipulation. Advances in Neural                    resolution via iterative refinement. IEEE Transactions on
     Information Processing Systems, 33:7198–7211, 2020. 7, 15                    Pattern Analysis and Machine Intelligence, 2022. 4
[40] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,              [52]   Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon.
     and Dani Lischinski. Styleclip: Text-driven manipulation of                  Unit-ddpm: Unpaired image translation with denois-
                                                                      10
       ing diffusion probabilistic models.           arXiv preprint
       arXiv:2104.05358, 2021. 4
[53]   Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. In-
       terpreting the latent space of gans for semantic face editing.
       In Proceedings of the IEEE/CVF Conference on Computer
       Vision and Pattern Recognition, pages 9243–9252, 2020. 3
[54]   Yujun Shen and Bolei Zhou. Closed-form factorization of
       latent semantics in gans. arXiv preprint arXiv:2007.06600,
       2020. 3
[55]   Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
       and Surya Ganguli. Deep unsupervised learning using
       nonequilibrium thermodynamics. In International Confer-
       ence on Machine Learning, pages 2256–2265. PMLR, 2015.
       4
[56]   Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
       ing diffusion implicit models. In International Conference
       on Learning Representations, 2020. 3, 6
[57]   Yang Song and Stefano Ermon. Generative modeling by esti-
       mating gradients of the data distribution. Advances in Neural
       Information Processing Systems, 32, 2019. 4
[58]   Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
       hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
       generative modeling through stochastic differential equa-
       tions. In International Conference on Learning Represen-
       tations, 2020. 4
[59]   Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon.
       Dual diffusion implicit bridges for image-to-image transla-
       tion. arXiv preprint arXiv:2203.08382, 2022. 3, 6, 7, 15
[60]   Lucas Theis, Tim Salimans, Matthew D Hoffman, and
       Fabian Mentzer. Lossy compression with Gaussian diffu-
       sion. arXiv preprint arXiv:2206.08889, 2022. 4
[61]   Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
       Daniel Cohen-Or. Designing an encoder for stylegan image
       manipulation. arXiv preprint arXiv:2102.02766, 2021. 3, 4
[62]   Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
       generative modeling in latent space. Advances in Neural In-
       formation Processing Systems, 34:11287–11302, 2021. 4
[63]   Pascal Vincent. A connection between score matching and
       denoising autoencoders. Neural computation, 23(7):1661–
       1674, 2011. 4
[64]   Julia Wolleb, Robin Sandkühler, Florentin Bieder, Philippe
       Valmaggia, and Philippe C Cattin. Diffusion models for
       implicit image segmentation ensembles. arXiv preprint
       arXiv:2112.03145, 2021. 4
[65]   Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
       man, and Oliver Wang. The unreasonable effectiveness of
       deep features as a perceptual metric. In Proceedings of the
       IEEE conference on computer vision and pattern recogni-
       tion, pages 586–595, 2018. 8
[66]   Roland S Zimmermann, Lukas Schott, Yang Song, Ben-
       jamin A Dunn, and David A Klindt. Score-based generative
       classifiers. arXiv preprint arXiv:2110.00473, 2021. 4
                                                                        11
A. Additional Results
Input Image Edited Image Input Image Edited Image Input Image Edited Image Input Image Edited Image
“A kitten lying down” “A horse raising its head” “A photo of a sitting dog” “A photo of a sitting giraffe”
“A bird eating a fish” “A photo of an open box” “A photo of a cat yawning” “A dog wearing a hat”
“Two pelicans kissing” “Two dogs growling at each other” “Two parrots looking down” “Two eggs in a nest”
“Two bananas” “Two cookies next to a glass of juice” “A photo of a birthday cake” “A waffle with whipped cream”
“A photo of a tree in snow” “A photo of a vase of colorful tulips” “A photo of a beach at night” “A drawing of a watermelon”
“A photo of a blue car” “A photo of a red chair” “A photo of a yellow shirt” “A painted Easter egg”
    “Two champagne glasses               “A photo of a guitar”            “A horse jumping out of the water”     “A soccer ball in the sand”
       on the windowsill”
Figure 11. Wide range of editing types. Additional 1024 ˆ 1024-pixel pairs of original (left) and edited (right) images using our
method (with target texts). Editing types include posture changes, composition changes, multiple object editing, object additions, object
replacements, style changes, and color changes.
                                                                     12
                                                           Increasing η
      Input Image                                                                                               Edited Image
B. Ablation Study
   In the paper, we performed ablation studies on model fine-tuning and interpolation intensity. Here we present a dis-
cussion on the necessity of text embedding optimization, and additional ablation studies on the number of text embedding
optimization steps and our method’s sensitivity to varying random seeds.
Text embedding optimization Our method consists of three main stages: text embedding optimization, model fine-tuning,
and interpolation. In the paper, we tested the value that the latter two stages add to our method. For the final two stages to
work well, the first one needs to provide two text embeddings to interpolate between: a “target” embedding and a “source”
embedding. Naturally, one might be inclined to ask the user for both a target text describing the desired edit, and a source text
describing the input image, which could theoretically replace the text embedding optimization stage. However, besides the
additional required user input, this option may be rendered impractical, depending on the architecture of the text embedding
model. For instance, Imagen [50] uses the T5 language model [43]. This model outputs a text embedding whose length
depends on the number of tokens in the text, requiring the two embeddings to be of the same length to enable interpolation.
It is highly impractical to request the user to provide that, especially since sentences may have a different number of tokens
even if they have the same number of words (depending on the tokenizer used). Therefore, we opt not to test this option, and
defer the pursuit of cleverer alternatives to future work.
Number of text embedding optimization steps We evaluate the effect of the number of text embedding optimization steps
on our editing results, both with and without model fine-tuning. We optimize the text embedding for 10, 100, and 1000 steps,
then fine-tune the 64 ˆ 64 diffusion model for 1500 steps separately on each optimized embedding. We fix the same random
seed and assess the editing results for η ranging from 0 to 1. From the visual results in Figure 13, we observe that a 10-step
optimization remains significantly close to the initial target text embedding, thereby retaining the same semantics in the pre-
trained model, and imposing the reconstruction of the input image on the entire interpolation range in the fine-tuned model.
Conversely, optimizing for 100 steps leads to an embedding that captures the basic essence of the input image, allowing for
                                                                  13
                                               Input Image:         , Target Text: “A photo of a dog lying down”
Figure 13. Ablation for number of embedding optimization steps. Editing results for varying η and number of text embedding opti-
mization steps, with and without fine-tuning (fixed seed).
meaningful interpolation. However, the embedding does not completely recover the image, and thus the interpolation fails
to apply the requested edit in the pre-trained model. Fine-tuning the model leads to an improved image reconstruction at
η “ 0, and enables the intermediate η values to match both the target text and the input image. Optimizing for 1000 steps
enhances the pre-trained model performance slightly, but offers no discernible improvement after fine-tuning, sometimes
even degrading it, in addition to incurring an added runtime cost. Therefore, we opt to apply our method using 100 text
embedding optimization steps and 1500 model fine-tuning steps for all examples shown in the paper.
Different seeds Since our method utilizes a probabilistic generative model, different random seeds incur different results for
the same input, as demonstrated in Figure 4. In Figure 14, we assess the effect of varying η values for different random seeds
on the same input. We notice that different seeds incur viable edited images at different η thresholds, obtaining different
results. For example, the first tested seed in Figure 14 first shows an edit at η “ 0.8, whereas the second one does so at
η “ 0.7. As for the third one, the image undergoes a significant unwanted change (the dog looks to the right instead of left) at
a lower η than when the edit is applied (the dog jumps). For some image-text inputs, we see behavior similar to the third seed
in all of the 5 random seeds that we test. We consider these as failure cases and show some of them in Figure 10. Different
target text prompts with similar meaning may circumvent these issues, since our optimization process is initialized with the
target text embedding. We do not explore this option as it would compromise the intuitiveness of our method.
                                                                           14
                               Input Image:              , Target Text: “A photo of a jumping dog”
η=0.000 η=0.100 η=0.200 η=0.300 η=0.400 η=0.500 η=0.600 η=0.700 η=0.800 η=0.900 η=1.000
Figure 14. Different seeds. Varying η values and different seeds produce different results for the same input.
                                                                  15
Figure 15. User study screenshot. An example screenshot of a question shown to participants in our human perceptual evaluation study.
16