MaskDiffusion Exploiting Pre-Trained Diffusion
MaskDiffusion Exploiting Pre-Trained Diffusion
1 Introduction
we leverage the internal feature space as well as the attention maps of a pre-
trained Stable Diffusion model.
Diffusion models [14, 28, 30] trained on large-scale datasets have revolution-
ized the field of image generation, which can be largely attributed to conditioning
with text embeddings derived from large-scale pre-trained visual-language mod-
els such as CLIP [27]. We hypothesize that these diffusion models have mastered
a wide variety of open vocabulary concepts that could be used for dense predic-
tion tasks, particularly for semantic segmentation. In previous works [22,41], the
internal features of a frozen diffusion model have already been used for semantic
segmentation after some additional training. Based on these observations, we
take a closer look at the internal features of diffusion models for semantic seg-
mentation and introduce MaskDiffusion, which achieves effective segmentation
in the wild without additional training.
The strengths of our MaskDiffusion are manifold. First, it eliminates the
need for pixel-by-pixel annotations typically required by popular semantic seg-
mentation techniques. Second, it has the ability to segment any class of objects,
distinguishing it from traditional diffusion-based methods [2, 3, 41].
As shown in Figure 1, applying k-means clustering on the internal features
of the U-Net in Stable Difusion already provides a rough yet consistent segmen-
tation, whereas MaskCLIP [44] fails to uniformly segment the image. We are
Abbreviated paper title 3
1. We analyze the internal features of diffusion models and show that they are
useful for semantic segmentation. The results of k-means classification of in-
ternal features are comparable to those of unsupervised mIoU, a conventional
unsupervised segmentation method.
2. We introduce MaskDiffusion and Unsupervised MaskDiffusion achieving com-
pelling segmentation results for all categories in the wild without any addi-
tional training.
3. MaskDiffusion model outperforms GEM [38] by 10.5 mIoU on the Potsdam
dataset [1], demonstrating the superior segmentation performance of our
proposed approach. Moreover, the Unsupervised MaskDiffusion model sur-
passes the performance of DiffSeg [35] by 14.8 mIoU on the COCO-Stuff
dataset [5], as measured by unsupervised mIoU.
2 Related Work
image
Text-to-Image Diffusion UNet
Cross-attention maps
VAE
Post
process
Internal feature
class names
cat, mirror, !
background
CLIP
1024 ×H ×W
3 Method
Weighted
Cross-attention maps Cosine
mean
similarity
×
d
1024 ×H ×W cat mirror background
Representative !
Semantic segmentation
!
Internal feature
\label {lossdiff} L_{LDM} := \mathbb {E}_{\varepsilon (x), y, \epsilon \sim \mathcal {N}(0,1),t} \left [ \| \epsilon - \epsilon _{\theta }(z_t, t, \tau _\theta (y)) \|^2_2 \right ] , (1)
where x is the input image, z and ε(x) are the latent images, ϵ is the noise
sampled from a normal distribution N (0, 1), t is the time step, y is the text
prompt, τθ is a pre-trained text encoder and ϵθ (zt , t, τθ (y)) is the predicted noise,
which is the output of the UNet.
A significant aspect of Stable Diffusion [28] lies in the utilization of a trained
CLIP [27] as the text encoder, facilitating the conversion of textual inputs into
corresponding vectors. Embedding these vectors into the UNet architecture en-
ables the integration of text-based conditioning during the denoising process,
achieved through cross-attention mechanisms. Attention consists of three vec-
tors: query (Q), key (K), and value (V ), and is defined as
\label {qkv} Q = W^{(i)}_Q \cdot \phi _i(z_t), K &= W^{(i)}_K \cdot \tau _{\theta }(y), V = W^{(i)}_V \cdot \tau _{\theta }(y), \nonumber \\ \text {Attention}(Q, K, V) &= \text {softmax}\left (\frac {QK^T}{\sqrt {d}}\right ) \cdot V ,
(2)
where d is the scaling factor, τθ is the text encoder, y is the text prompt, N
is the number of tokens in the input sequence, ϕi (zt ) ∈ RN ×di is a (flattened)
Abbreviated paper title 7
(i)
intermediate representation of the UNet implementing ϵθ and WV ∈ Rd×di ,
(i) (i)
WQ ∈ Rd×dτ , WK ∈ Rd×dτ are projection matrices. The query is a vector
created from the input image, while the key and value are created from the text
vector.
3.2 MaskDiffusion
Figure 2 shows our proposed method which we call MaskDiffusion, a novel ap-
proach that leverages a pre-trained diffusion model as a semantic segmentation
model, eliminating the need for additional training. MaskDiffusion capitalizes
on the observation that the intrinsic features derived from the internal layer
of the UNet in Stable Diffusion [28] inherently contain semantic information,
as demonstrated in the experimental results outlined in Section 4.3. For each
pixel of the input image we obtain an internal feature, denoted as f ∈ R1024 , by
upsampling and combining the output of the internal layers as follows,
\label {eq:f} \mathbf {f}_i = \text {UNetLayer}_i(x_i, \tau _{\theta }(s)), \nonumber \\ \mathbf {f} = \text {Concatenate}(\mathbf {f}_1, ..., \mathbf {f}_n) ,
(3)
where xi denotes the input to the internal UNet layer, s represents the prompt
and n is the number of internal UNet layers. Note that x0 is the input image
and x1 and beyond are the output of previous layers.
Due to its high dimensionality, determining the class of each pixel solely
based on f proves to be a challenging task. Therefore, we propose to incorporate
the information contained in the cross-attention maps in the UNet architecture
into f in order to assign a class to each pixel.
!
VAE
1024 ×H ×W
Post
process
\label {eq:pp} \bar {\mathbf {f}}_{c} &= \frac {1}{A_c} \sum _{n,m} a_{cnm} \cdot \mathbf {f}_{nm}, \quad A_c=\sum _{n,m} a_{cnm} (5)
\label {eq:pp2} s_{nm} &= \underset {c}{\text {argmax}} \frac {\mathbf {f}_{nm}^T \cdot \bar {\mathbf {f}}_c}{\| \mathbf {f}_{nm} \| \cdot \|\bar {\mathbf {f}_c} \|} , (6)
and can be interpreted as assigning the category, represented by f̄c , which ex-
hibits the closest resemblance with the internal feature fnm of the corresponding
pixel.
unsupervised segmentation. As such, this method does not require any prompts
and only relies on images as input, which corresponds to the same setting as
DiffSeg [35].
Unsupervised MaskDiffusion overview is shown in Figure 4. In order to cluster
the obtained internal features in the text-free setting, we explore two different
clustering methods, namely k-means and spectral clustering [25]. In particular,
we employ spectral clustering [25] to include the spatial relationships of the
internal features in the segmentation process. Based on the observation that
internal features representing the same class should be close with respect to the
distance in pixel space, we compute the similarity map h ∈ RHW ×HW between
the reshaped internal features f̂ ∈ RHW ×1024 as
\label {sim_map} \mathbf {h}_{ij} = \frac {\mathbf {\hat {\mathbf {f}}_{i}}^T \cdot \mathbf {\hat {\mathbf {f}}_{j}}}{\| \mathbf {\hat {\mathbf {f}}_{i}} \| \cdot \|\mathbf {\hat {\mathbf {f}}_{j}} \|} , (7)
where i and j denote the pixel location. A large value of hij in this map indicates
that the pixels at i and j very likely belong to the same class. Note that we set
H and W to 100 and accordingly resize f to 100 × 100 to reduce computational
complexity.
Next, we derive a Laplacian matrix from the similarity map and classify
clusters based on their smallest eigenvectors. The clustering results are then
post-processed using f , where the clustering results replace the cross-attention
map in Equation 5 to produce the final segmentation image. Note that the values
of the spectral clustering [25] are either 0 or 1, whereas the cross-attention map
consists of values between 0 and 1.
4 Experiment
First, we present the implementation details and then compare our results to
previous methods in Section 4.2 and 4.3. Next, we evaluate the open vocabu-
lary aspect in Section 4.4. Finally, we justify the construction of MaskDiffusion
through an ablation study in Section 4.5.
MaskCLIP
GEM
MaskDiffusion
(Ours)
GT
Fig. 5: Qualitative results. Images (a) to (c) depict scenes from PascalVOC [11],
images (d) and (e) represent scenes from Cityscapes [9], and image (f) shows scene from
Potsdam [1]. We compare MaskCLIP [44], GEM [38], and our proposed MaskDiffusion.
MaskCLIP [44] and GEM [38] provide fragmented and noisy segmentations, whereas
MaskDiffusion exhibits a more cohesive and accurate segmentation for each object.
MaskCLIP [44], GEM [38], and our MaskDiffusion. MaskCLIP [44] and GEM [38]
result in a noisy and fragmented segmentation, whereas MaskDiffusion achieves
more consistent segments that better correspond with the shape of the object
and exhibit minimal noise.
To validate its performance, we conducted experiments comparing our Un-
supervised MaskDiffusion to its closest counterpart, DiffSeg [35] under the same
conditions of image unit segmentation as specified in their paper [35]. Following
other unsupervised segmentation methods [8, 13, 32, 35], we quantify the per-
formance of each method by employing the unsupervised mIoU, which matches
unlabeled clusters with ground-truth labels using a Hungarian matching algo-
rithm. We test on Cityscapes [9] with 19 and 27 classes and COCO-Stuff [5] with
27 classes. For a fair comparison, we input an empty string as a prompt.
The experimental results are shown in Table 3, which confirms that our
proposed Unsupervised MaskDiffusion outperforms conventional segmentation
methods using diffusion models and further underscores the effectiveness of in-
ternal features and their similarity map.
Unsupervised Unsupervised
input k-means input k-means
MaskDiffusion MaskDiffusion
the results demonstrate that applying k-means on the internal features signifi-
cantly outperforms traditional unsupervised segmentation, for example by 6.8%
for the COCO-Stuff dataset. Moreover, the performance exceeds that of DINOv2
k-means across all datasets, showcasing the robust capabilities of the internal fea-
tures without the need for specific training. Collectively, our findings establish
the semantic nature of the internal features.
Figure 6 shows the clustering results of Unsupervised MaskDiffusion and
internal features via k-means. It demonstrates that Unsupervised MaskDiffusion
can more cleanly classify identical classes into the same segment and different
classes into separate segments, compared to k-means of internal feature.
depicted in (c), indicating that its segmentation performance improves with more
general classes. Impressively, the segmentation of proper nouns is also achievable,
as evidenced in results (d) and (e). These findings serve as compelling evidence
that MaskDiffusion exhibits a robust capability for accurately segmenting open
vocabularies, including complex and specific categories, thus underscoring its
versatility and effectiveness in handling diverse and intricate segmentation tasks.
input 128 64
ferent layers. Considering the structure of UNet, the outer intermediate outputs
are of larger size compared to the smaller inner intermediate outputs. Therefore,
we compare various combinations at different positions, which are detailed in
Table 5. Overall, combining the cross-attention maps with a resolution of 8 × 8
and 16 × 16 yields the best results.
Similarly, we examine the effect of combining different internal feature sizes
by applying k-means and evaluating the resulting segmentation with the unsu-
pervised mIoU. The quantitative results are summarized in Table 6, indicating
that combining feature maps of size 8, 16, and 32 clearly outperforms other size
combinations. Figure 8 presents the qualitative results of the ablation study.
Since earlier layers of the U-Net typically generate low-level features, the output
is less suitable for segmentation, whereas inner outputs provide more semantic
information.
These findings highlight the critical role that both, the time step and the
internal features, play in the segmentation performance of MaskDiffusion.
Next, to explore the limitations of the internal features f , we created rep-
resentative f using images and ground truth on the training dataset. In this
experiment, f is first created from images in the training dataset, and then the
Abbreviated paper title 15
5 Limitation
While MaskDiffusion achieves remarkable results, especially compared to related
approaches, our proposed method still comes with certain limitations. First, the
cross-attention map has a low level of assigning internal features to each class.
This can be seen in Figure 5(b), where the segmentation is clean but the class
assignment is incorrect. Second, our setting assumes that potential candidates for
the classes appearing in the images are known beforehand. While we consider this
to be outside of the scope of this paper, it would be possible to use MaskDiffusion
in conjunction with models that are able to detect the presence of objects in
images, e.g. CLIP [27], to solve this limitation.
16 Y. Kawano and Y. Aoki.
6 Conclusion
References
1. Isprs. isprs 2d semantic labeling contest. http://www2.isprs.org/commissions/
comm3/wg4/semantic-labeling.html 3, 10
2. Amit, T., Shaharbany, T., Nachmani, E., Wolf, L.: Segdiff: Image segmentation
with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021) 2, 5
3. Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A.:
Label-efficient semantic segmentation with diffusion models. arXiv preprint
arXiv:2112.03126 (2021) 2, 5
4. Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Ad-
vances in Neural Information Processing Systems 32 (2019) 4
5. Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 1209–1218 (2018) 3, 10, 11, 12, 15
6. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin,
A.: Emerging properties in self-supervised vision transformers. In: Proceedings of
the International Conference on Computer Vision (2021) 1, 4, 11
7. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) 3
8. Cho, J.H., Mall, U., Bala, K., Hariharan, B.: Picie: Unsupervised semantic seg-
mentation using invariance and equivariance in clustering. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021) 1, 4,
11
9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3213–3223 (2016) 2, 10, 11, 12, 13, 14, 15
10. Ding, Z., Wang, J., Tu, Z.: Open-vocabulary universal image segmentation with
maskclip (2023) 1
11. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman,
A.: The pascal visual object classes challenge: A retrospective. International journal
of computer vision 111, 98–136 (2015) 10, 15
12. Feng, Q., Gadde, R., Liao, W., Ramon, E., Martinez, A.: Network-free, unsu-
pervised semantic segmentation with synthetic images. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23602–
23610 (2023) 5
13. Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised
semantic segmentation by distilling feature correspondences. In: International Con-
ference on Learning Representations (2022), https://openreview.net/forum?id=
SaKO6z6Hl0c 1, 4, 11
14. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in
neural information processing systems 33, 6840–6851 (2020) 2, 3
15. Ji, X., Henriques, J.F., Vedaldi, A.: Invariant information clustering for unsuper-
vised image classification and segmentation. In: Proceedings of the IEEE Interna-
tional Conference on Computer Vision. pp. 9865–9874 (2019) 4, 11
16. Jiang, P., Gu, F., Wang, Y., Tu, C., Chen, B.: Difnet: Semantic segmentation by
diffusion networks. Advances in Neural Information Processing Systems 31 (2018)
5
17. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. pp. 4401–4410 (2019) 5
18 Y. Kawano and Y. Aoki.
18. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 8110–8119 (2020) 5
19. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114 (2013) 3
20. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint
arXiv:2304.02643 (2023) 1, 4
21. Li, D., Yang, J., Kreis, K., Torralba, A., Fidler, S.: Semantic segmentation with gen-
erative models: Semi-supervised learning and strong out-of-domain generalization.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 8300–8311 (2021) 5
22. Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P.,
Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 7061–7070 (2023) 1, 2
23. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3431–3440 (2015) 3
24. Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 7086–7096 (2022) 4
25. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm.
Advances in neural information processing systems 14 (2001) 3, 8, 9
26. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V.,
Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust
visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 1, 11
27. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning. pp.
8748–8763. PMLR (2021) 2, 3, 4, 6, 9, 15
28. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2,
3, 4, 6, 7, 9, 11, 13
29. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc-
tober 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015) 3
30. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour,
K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-
to-image diffusion models with deep language understanding. Advances in Neural
Information Processing Systems 35, 36479–36494 (2022) 2
31. Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta,
A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-
filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 3,
9
32. Seong, H.S., Moon, W., Lee, S., Heo, J.P.: Leveraging hidden positives for unsu-
pervised semantic segmentation (June 2023) 1, 4, 11
33. Shin, G., Xie, W., Albanie, S.: Reco: Retrieve and co-segment for zero-shot transfer.
Advances in Neural Information Processing Systems 35, 33754–33767 (2022) 4
Abbreviated paper title 19
34. Tan, W., Chen, S., Yan, B.: Diffss: Diffusion model for few-shot semantic segmen-
tation. arXiv preprint arXiv:2307.00773 (2023) 5
35. Tian, J., Aggarwal, L., Colaco, A., Kira, Z., Gonzalez-Franco, M.: Diffuse, attend,
and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv
preprint arXiv:2308.12469 (2023) 3, 4, 5, 9, 11
36. Tritrong, N., Rewatbowornwong, P., Suwajanakorn, S.: Repurposing gans for one-
shot semantic part segmentation. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. pp. 4475–4485 (2021) 5
37. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro-
cessing systems 30 (2017) 3
38. Walid Bousselham, Felix Petersen, V.F.H.K.: Grounding everything: Emerg-
ing localization properties in vision-language transformers. arXiv preprint
arXiv:2312.00878 (2023) 3, 4, 10, 11
39. Wolleb, J., Sandkühler, R., Bieder, F., Valmaggia, P., Cattin, P.C.: Diffusion mod-
els for implicit image segmentation ensembles. In: International Conference on
Medical Imaging with Deep Learning. pp. 1336–1348. PMLR (2022) 5
40. Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: Diffumask: Synthesizing im-
ages with pixel-level annotations for semantic segmentation using diffusion models.
arXiv preprint arXiv:2303.11681 (2023) 5
41. Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary
panoptic segmentation with text-to-image diffusion models. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2955–
2966 (2023) 1, 2, 4, 5
42. Zhang, J., Herrmann, C., Hur, J., Cabrera, L.P., Jampani, V., Sun, D., Yang, M.H.:
A tale of two features: Stable diffusion complements dino for zero-shot semantic
correspondence. arXiv preprint arXiv:2305.15347 (2023) 5
43. Zhaoa, Z., Wang, Y., Liu, K., Yang, H., Sun, Q., Qiao, H.: Semantic segmenta-
tion by improved generative adversarial networks. arXiv preprint arXiv:2104.09917
(2021) 5
44. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European
Conference on Computer Vision. pp. 696–712. Springer (2022) 1, 2, 4, 10, 11
45. Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Gao, J., Lee, Y.J.: Segment everything
everywhere all at once. arXiv preprint arXiv:2304.06718 (2023) 4