0% found this document useful (0 votes)

84 views19 pages

MaskDiffusion Exploiting Pre-Trained Diffusion

Uploaded by

stephenlee787324564

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views19 pages

MaskDiffusion Exploiting Pre-Trained Diffusion

Uploaded by

stephenlee787324564

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

MaskDiffusion: Exploiting Pre-trained Diffusion

Models for Semantic Segmentation

Yasufumi Kawano1 and Yoshimitsu Aoki1

Keio University, Japan

ykawano@aoki-medialab.jp, aoki@elec.keio.ac.jp
arXiv:2403.11194v1 [cs.CV] 17 Mar 2024

Abstract. Semantic segmentation is essential in computer vision for

various applications, yet traditional approaches face significant challenges,
including the high cost of annotation and extensive training for super-
vised learning. Additionally, due to the limited predefined categories in
supervised learning, models typically struggle with infrequent classes and
are unable to predict novel classes. To address these limitations, we pro-
pose MaskDiffusion, an innovative approach that leverages pretrained
frozen Stable Diffusion to achieve open-vocabulary semantic segmenta-
tion without the need for additional training or annotation, leading to
improved performance compared to similar methods. We also demon-
strate the superior performance of MaskDiffusion in handling open vo-
cabularies, including fine-grained and proper noun-based categories, thus
expanding the scope of segmentation applications. Overall, our MaskD-
iffusion shows significant qualitative and quantitative improvements in
contrast to other comparable unsupervised segmentation methods, i.e. on
the Potsdam dataset (+10.5 mIoU compared to GEM) and COCO-Stuff
(+14.8 mIoU compared to DiffSeg). All code and data will be released
at https://github.com/Valkyrja3607/MaskDiffusion.

1 Introduction

Semantic segmentation, a fundamental task in computer vision, assigns class

labels to every pixel in an image. Its applications span across diverse domains,
including automated driving and medical image analysis. Despite its significance,
current semantic segmentation methods still face several critical challenges. First
of all, these methods tend to be costly, requiring pixel-level annotation as well
as extensive training. Secondly, since supervised learning relies on a pre-defined
set of categories, detecting extremely rare or even completely new classes during
prediction becomes virtually impossible.
In this paper, we address these limitations by combining two related tasks,
namely unsupervised and open-vocabulary semantic segmentation. Unsupervised
semantic segmentation [8,13,32] avoids costly annotation by leveraging represen-
tations obtained through a model [6, 26] which has been trained on a different
task. On the other hand, open-vocabulary semantic segmentation [10, 20, 22, 41,
44] allows recognizing a wide array of categories through natural language and is
not bound to a pre-defined set of categories. To integrate these two approaches,
2 Y. Kawano and Y. Aoki.

input image (a) k-means of Diffusion

U-Net internal features

(b) MaskDiffusion(Ours) (c) MaskCLIP

Fig. 1: Comparison of MaskDiffusion with previous method MaskCLIP [44]

on a Cityscapes [9] image. k-means clustering on the internal features of the Dif-
fusion U-Net (a) shows that each determined cluster roughly partitions the image
according to some classes, indicating that the semantic information is well preserved.
MaskDiffusion (b) yields well-partitioned segments consistent with the shape of the
object and exhibits minimal noise. In comparison, MaskCLIP [44] (c) results in smaller
and noisy segments.

we leverage the internal feature space as well as the attention maps of a pre-
trained Stable Diffusion model.
Diffusion models [14, 28, 30] trained on large-scale datasets have revolution-
ized the field of image generation, which can be largely attributed to conditioning
with text embeddings derived from large-scale pre-trained visual-language mod-
els such as CLIP [27]. We hypothesize that these diffusion models have mastered
a wide variety of open vocabulary concepts that could be used for dense predic-
tion tasks, particularly for semantic segmentation. In previous works [22,41], the
internal features of a frozen diffusion model have already been used for semantic
segmentation after some additional training. Based on these observations, we
take a closer look at the internal features of diffusion models for semantic seg-
mentation and introduce MaskDiffusion, which achieves effective segmentation
in the wild without additional training.
The strengths of our MaskDiffusion are manifold. First, it eliminates the
need for pixel-by-pixel annotations typically required by popular semantic seg-
mentation techniques. Second, it has the ability to segment any class of objects,
distinguishing it from traditional diffusion-based methods [2, 3, 41].
As shown in Figure 1, applying k-means clustering on the internal features
of the U-Net in Stable Difusion already provides a rough yet consistent segmen-
tation, whereas MaskCLIP [44] fails to uniformly segment the image. We are
Abbreviated paper title 3

able to further improve segmentation results leveraging internal features with

our proposed MaskDiffusion.
In addition, based on the observation that internal features are useful for
segmentation, we propose an unsupervised segmentation method, called Unsu-
pervised MaskDiffusion. Unsupervised MaskDiffusion, unlike MaskDiffusion,
operates solely on image inputs without requiring class candidate prompts. This
is why it is "Unsupervised". Unsupervised MaskDiffusion can segment objects
of the same class by spectral clustering [25], computing a Laplacian matrix from
the similarity between pixels of internal features.
Our contributions are the following:

1. We analyze the internal features of diffusion models and show that they are
useful for semantic segmentation. The results of k-means classification of in-
ternal features are comparable to those of unsupervised mIoU, a conventional
unsupervised segmentation method.
2. We introduce MaskDiffusion and Unsupervised MaskDiffusion achieving com-
pelling segmentation results for all categories in the wild without any addi-
tional training.
3. MaskDiffusion model outperforms GEM [38] by 10.5 mIoU on the Potsdam
dataset [1], demonstrating the superior segmentation performance of our
proposed approach. Moreover, the Unsupervised MaskDiffusion model sur-
passes the performance of DiffSeg [35] by 14.8 mIoU on the COCO-Stuff
dataset [5], as measured by unsupervised mIoU.

2 Related Work

2.1 Diffusion Model

Diffusion models, particularly Denoising Diffusion Probabilistic Models (DDPM) [14],

excel in generating high-quality images by iteratively adding and then estimating
and removing Gaussian noise from initial images. Furthermore, the implemen-
tation of Stable Diffusion utilizes a Latent Diffusion Model (LDM) [28], which
combines Variational Autoencoder (VAE) [19], UNet [29], and CLIP [27], a pow-
erful visual language model. Stable Diffusion integrates CLIP, a text encoder, to
represent the text content in the image. This integration involves conditioning
the latent space in the UNet [29] architecture with text, effectively enabling the
representation of textual content in synthesized images. Notably, UNet’s train-
ing on the LAION-5B dataset [31], with its 5 billion image-text pairs, highlights
the importance of large-scale data for robust image-text generation.

2.2 Semantic Segmentation

Semantic segmentation involves pixel-wise class labeling, commonly using convo-

lutional neural networks [7,23] or vision transformers [37] for end-to-end training.
These methods, while effective, depend on extensive labeled data and significant
4 Y. Kawano and Y. Aoki.

Table 1: Relationship with related works.

Backbone Additional Language Class Mapping

Method
Pretraining Training Dependency Identifiability Consistency
Open Vocabulary Segmentation
√ √ √ √
ODISE [41] SD [28] internal feature
√ √ √
MaskCLIP [44] CLIP [27] -
√ √ √
MaskDiffusion (Ours) SD [28] internal feature -
Unsupervised Segmentation
√ √
STEGO [13] DINO [6] - -
DiffSeg [35] SD [28] self-attention - - - -
Unsupervised MaskDiffusion (Ours) SD [28] internal feature - - - -

computational resources, and are limited to predefined categories. Thus, unsu-

pervised, and domain-flexible approaches have recently gained importance.
Unsupervised semantic segmentation [8,13,15,32] attempts to solve semantic
segmentation without using any kind of supervision. STEGO [13] and HP [32]
optimize the head of a segmentation model using image features obtained from
a backbone pre-trained by DINO [6], an unsupervised method for many tasks.
However, unsupervised semantic segmentation clusters images by class but can-
not identify each cluster’s class. In contrast, our MaskDiffusion distinguishes
classes without extra annotation or training.
Open vocabulary semantic segmentation, crucial for segmenting objects across
domains without being limited to predefined categories, has seen notable ad-
vancements with the introduction of key methodologies [4, 20, 24, 33, 38, 44, 45].
MaskCLIP [44], GEM [38] and CLIPSeg [24], based on the CLIP [27] visual lan-
guage model, have advanced open vocabulary semantic segmentation. In particu-
lar, MaskCLIP [44] and GEM [38] segment various categories without additional
training. These two approaches, reducing annotation and training costs and over-
coming category limitations, have inspired a new direction in our research, mo-
tivating us to explore similar problem formulations. Furthermore, ODISE [41]
has extended the capabilities of pre-trained diffusion models, incorporating addi-
tional training of a segmentation network to achieve open-vocabulary panoptic
segmentation. The application of image generation models to segmentation is
discussed in detail in Sec 2.3.
Table 1 shows the pre-trained model used (Backbone Pretraining), whether
additional training is used (Additional Training), whether text input (prompt)
is required for segmentation (Language Dependency), whether the same class
can be assigned to the same segment index across images (Class Identifiability),
and whether there is consistency in the index mapped across images (Mapping
Consistency). In particular, for Class Identifiability and Mapping Consistency,
unsupervised segmentation methods do not explicitly know the mapping between
indices and classes in the model’s output images. STEGO [13], however, have a
consistent mapping across images. In contrast, DiffSeg [35] and our Unsupervised
DiffusionMask output different indices for segments of the same class across
images.
Abbreviated paper title 5

image
Text-to-Image Diffusion UNet
Cross-attention maps

VAE

Post
process
Internal feature

class names
cat, mirror, !
background
CLIP
1024 ×H ×W

Fig. 2: High-level overview of our MaskDiffusion architecture. MaskDiffusion

uses a frozen pre-trained diffusion model. The UNet is given latent images compressed
by a VAE as well as text prompts embedded by CLIP. The prompts are the names
of all the potential classes to be segmented. The output of each layer of the U-Net is
extracted as a concatenated internal feature f and a cross-attention map, which are
subsequently post-processed into a segmentation image.

2.3 Generative Models for Segmentation

The use of image generative models, including Generative Adversarial Networks

(GAN) [12, 17, 18, 21, 36, 43] and diffusion models [2, 3, 16, 34, 35, 39–42], has been
a focal point in various prior studies concerning semantic segmentation. DDPM-
Seg [3] revealed that a generative model’s internal representation correlates with
visual semantics, aiding semantic segmentation but limited to a predefined set
of labels. Addressing this, ODISE [41] enables panoptic segmentation with open
vocabularies. DiffSeg [35], similarly to our method, provides unsupervised seg-
mentation without additional training by using KL divergence in UNet’s self-
attention maps for segmentation, but it cannot consistently map the same class
to the same index across images.
Several methodologies [3, 16, 34, 39, 41, 42] have leveraged diffusion models as
a foundational backbone for segmentation tasks, often incorporating additional
models and training to facilitate the segmentation process.
In alignment with the existing research landscape, we propose an open-
vocabulary semantic segmentation method without additional training.

3 Method

Our research attempts to explore the applicability of diffusion models to se-

mantic segmentation. We begin with a brief introduction to the architecture of
diffusion models in Sec 3.1, followed by a detailed description of the proposed
MaskDiffusion in Sec 3.2.
6 Y. Kawano and Y. Aoki.

1024 dim × class num

Weighted
Cross-attention maps Cosine
mean
similarity
×
d
1024 ×H ×W cat mirror background

Representative !

Semantic segmentation
!

Internal feature

Fig. 3: Overview of the post processing step. First, a representative f is computed

for each category through a weighted average of f based on the values of the cross-
attention map. In the next step, we determine the semantic segmentation result by
evaluating the cosine similarity between f and the representative f for each category
and then assign each f to the category that has the closest similarity.

3.1 Diffusion Model Architecture

Stable Diffusion [28] follows the two processes in diffusion models, namely the
diffusion process and the inverse diffusion process. The former involves the in-
troduction of Gaussian noise to a clean image, while the latter employs a single
UNet architecture to eliminate the noise and restore the original image. The
diffusion and inverse diffusion processes each consist of 1000 steps, and a single
UNet is responsible for all time steps. During training, we estimate the noise as

\label {lossdiff} L_{LDM} := \mathbb {E}_{\varepsilon (x), y, \epsilon \sim \mathcal {N}(0,1),t} \left [ \| \epsilon - \epsilon _{\theta }(z_t, t, \tau _\theta (y)) \|^2_2 \right ] , (1)

where x is the input image, z and ε(x) are the latent images, ϵ is the noise
sampled from a normal distribution N (0, 1), t is the time step, y is the text
prompt, τθ is a pre-trained text encoder and ϵθ (zt , t, τθ (y)) is the predicted noise,
which is the output of the UNet.
A significant aspect of Stable Diffusion [28] lies in the utilization of a trained
CLIP [27] as the text encoder, facilitating the conversion of textual inputs into
corresponding vectors. Embedding these vectors into the UNet architecture en-
ables the integration of text-based conditioning during the denoising process,
achieved through cross-attention mechanisms. Attention consists of three vec-
tors: query (Q), key (K), and value (V ), and is defined as

\label {qkv} Q = W^{(i)}_Q \cdot \phi _i(z_t), K &= W^{(i)}_K \cdot \tau _{\theta }(y), V = W^{(i)}_V \cdot \tau _{\theta }(y), \nonumber \\ \text {Attention}(Q, K, V) &= \text {softmax}\left (\frac {QK^T}{\sqrt {d}}\right ) \cdot V ,

(2)

where d is the scaling factor, τθ is the text encoder, y is the text prompt, N
is the number of tokens in the input sequence, ϕi (zt ) ∈ RN ×di is a (flattened)
Abbreviated paper title 7

(i)
intermediate representation of the UNet implementing ϵθ and WV ∈ Rd×di ,
(i) (i)
WQ ∈ Rd×dτ , WK ∈ Rd×dτ are projection matrices. The query is a vector
created from the input image, while the key and value are created from the text
vector.

3.2 MaskDiffusion
Figure 2 shows our proposed method which we call MaskDiffusion, a novel ap-
proach that leverages a pre-trained diffusion model as a semantic segmentation
model, eliminating the need for additional training. MaskDiffusion capitalizes
on the observation that the intrinsic features derived from the internal layer
of the UNet in Stable Diffusion [28] inherently contain semantic information,
as demonstrated in the experimental results outlined in Section 4.3. For each
pixel of the input image we obtain an internal feature, denoted as f ∈ R1024 , by
upsampling and combining the output of the internal layers as follows,

\label {eq:f} \mathbf {f}_i = \text {UNetLayer}_i(x_i, \tau _{\theta }(s)), \nonumber \\ \mathbf {f} = \text {Concatenate}(\mathbf {f}_1, ..., \mathbf {f}_n) ,
(3)

where xi denotes the input to the internal UNet layer, s represents the prompt
and n is the number of internal UNet layers. Note that x0 is the input image
and x1 and beyond are the output of previous layers.
Due to its high dimensionality, determining the class of each pixel solely
based on f proves to be a challenging task. Therefore, we propose to incorporate
the information contained in the cross-attention maps in the UNet architecture
into f in order to assign a class to each pixel.

\label {cross-attention} \text {Cross-attention map}(Q, K) = \frac {QK^T}{\sqrt {d}} (4)

We define the cross-attention map as shown in Equation 4 based on the

definition of the cross-attention in Equation 2. Note that we do not use the
softmax function in Equation 4 as it introduces undesirable normalization across
classes, which disrupts the intrinsic relationship between pixels and consequently
leads to poorer performance.
Cross-attention establishes a relationship between the textual input prompts,
which include the category names for segmentation, and the pixel-wise features
of the image. In essence, the cross-attention map can be viewed as a preliminary
semantic segmentation map, highlighting areas of potential object localization.
However, using the cross-attention map as a standalone solution does not lead
to accurate results as shown through our experiments, which are summarized
in Table 2. Instead, we leverage the cross-attention map in combination with
the extracted high-dimensional internal features, f , to assign each pixel a corre-
sponding class.
Next, we explain how to incorporate the information of the cross-attention
map into the internal features f ∈ R1024×H×W as illustrated in Figure 3. First,
8 Y. Kawano and Y. Aoki.

image Internal feature

Text-to-Image Diffusion UNet

!
VAE
1024 ×H ×W
Post
process

reshape Cosine similarity map

Clustering results
Spectral
Clustering

1024 ×(H ×W) (H ×W)×(H ×W)

Fig. 4: Overview of Unsupervised MaskDiffusion architecture. we employ spec-

tral clustering [25] to include the spatial relationships of the internal features in the
segmentation process.

we compute a representative internal feature f̄c ∈ R1024 for each category c

through a weighted average based on the values acnm ∈ R of the cross-attention
map, as follows:

\label {eq:pp} \bar {\mathbf {f}}_{c} &= \frac {1}{A_c} \sum _{n,m} a_{cnm} \cdot \mathbf {f}_{nm}, \quad A_c=\sum _{n,m} a_{cnm} (5)

In the subsequent step, the semantic map s ∈ RH×W is obtained by calcu-

lating the cosine similarity for each pixel position (n, m) between the internal
feature vector and the representative feature f̄c for each class and subsequently
assigning the class with the highest similarity. Mathematically, this process is
described as

\label {eq:pp2} s_{nm} &= \underset {c}{\text {argmax}} \frac {\mathbf {f}_{nm}^T \cdot \bar {\mathbf {f}}_c}{\| \mathbf {f}_{nm} \| \cdot \|\bar {\mathbf {f}_c} \|} , (6)

and can be interpreted as assigning the category, represented by f̄c , which ex-
hibits the closest resemblance with the internal feature fnm of the corresponding
pixel.

3.3 Unsupervised MaskDiffusion

In our experimental results with MaskDiffusion, as detailed in Section 4.2, we
found that the internal features seem to be quite powerful, while the cross-
attention map demonstrates some limitations. As a consequence, we also ex-
plore an alternative approach that only uses internal features in the context of
Abbreviated paper title 9

unsupervised segmentation. As such, this method does not require any prompts
and only relies on images as input, which corresponds to the same setting as
DiffSeg [35].
Unsupervised MaskDiffusion overview is shown in Figure 4. In order to cluster
the obtained internal features in the text-free setting, we explore two different
clustering methods, namely k-means and spectral clustering [25]. In particular,
we employ spectral clustering [25] to include the spatial relationships of the
internal features in the segmentation process. Based on the observation that
internal features representing the same class should be close with respect to the
distance in pixel space, we compute the similarity map h ∈ RHW ×HW between
the reshaped internal features f̂ ∈ RHW ×1024 as

\label {sim_map} \mathbf {h}_{ij} = \frac {\mathbf {\hat {\mathbf {f}}_{i}}^T \cdot \mathbf {\hat {\mathbf {f}}_{j}}}{\| \mathbf {\hat {\mathbf {f}}_{i}} \| \cdot \|\mathbf {\hat {\mathbf {f}}_{j}} \|} , (7)

where i and j denote the pixel location. A large value of hij in this map indicates
that the pixels at i and j very likely belong to the same class. Note that we set
H and W to 100 and accordingly resize f to 100 × 100 to reduce computational
complexity.
Next, we derive a Laplacian matrix from the similarity map and classify
clusters based on their smallest eigenvectors. The clustering results are then
post-processed using f , where the clustering results replace the cross-attention
map in Equation 5 to produce the final segmentation image. Note that the values
of the spectral clustering [25] are either 0 or 1, whereas the cross-attention map
consists of values between 0 and 1.

4 Experiment
First, we present the implementation details and then compare our results to
previous methods in Section 4.2 and 4.3. Next, we evaluate the open vocabu-
lary aspect in Section 4.4. Finally, we justify the construction of MaskDiffusion
through an ablation study in Section 4.5.

4.1 Implementation Details

For our implementation of MaskDiffusion, we employed a frozen Stable Diffusion
model [28] that was pre-trained on a subset of the LAION [31] dataset, utiliz-
ing CLIP [27]’s ViT-L/14 architecture to condition the diffusion process on text
input. We extracted internal features and cross-attention maps from Stable Dif-
fusion’s three UNet blocks and combined them by resizing and concatenating. In
addition, images exceeding 512 × 512 are segmented into 512 × 512 patches prior
to model input. Furthermore, we set the time step for the diffusion process to
t = 1, representing the last denoising step, as this ensures an efficient processing
time of less than 2 seconds per image. Our model works with a GPU memory of
15 GB.
10 Y. Kawano and Y. Aoki.

Table 2: Comparison of MaskCLIP [44], GEM [38], and our MaskDiffusion.

We evaluate on Potsdam [1], Cityscapes [9], PascalVOC [11] and COCO-Stuff [5] and
report the mIoU.

Potsdam [1] Cityscapes [9] PascalVOC [11] COCO-Stuff [5]

Method
6classes 19class 20class 171class
MaskCLIP [44] 10.4 15.5 28.6 1.6
GEM [38] 10.7 16.3 26.5 0.8
MaskDiffusion (Ours) 21.2 17.1 29.9 5.5

MaskCLIP

GEM

MaskDiffusion
(Ours)

(a) (b) (c) (d) (e) (f)

Fig. 5: Qualitative results. Images (a) to (c) depict scenes from PascalVOC [11],
images (d) and (e) represent scenes from Cityscapes [9], and image (f) shows scene from
Potsdam [1]. We compare MaskCLIP [44], GEM [38], and our proposed MaskDiffusion.
MaskCLIP [44] and GEM [38] provide fragmented and noisy segmentations, whereas
MaskDiffusion exhibits a more cohesive and accurate segmentation for each object.

4.2 Main Results

To validate the performance of MaskDiffusion, we conducted comprehensive com-
parative experiments with its closest counterpart, MaskCLIP [44] and GEM [38].
For the evaluation, we used the mean Intersection over Union (mIoU) as the pri-
mary metric. We perform our experiments on the Potsdam [1] dataset comprising
of 6 classes, Cityscapes [9] with 19 classes, and PascalVOC [11] with 20 classes,
as well as COCO-Stuff [5] consisting of 171 classes. Note that we input the names
of all the classes included in the dataset as prompts.
The quantitative results are presented in detail in Table 2. Our method out-
performs previous research across all datasets, particularly achieving a +10.5
mIoU on the Potsdam [1] dataset, a land cover task dataset, demonstrating
the robustness of the internal features for class estimation over CLIP-based ap-
proaches.
The qualitative results are shown in Figure 5. Images (a) to (c) from Pas-
calVOC [11], (d) and (e) from Cityscapes [9], and (f) from Potsdam [1] compare
Abbreviated paper title 11

Table 3: Comparison of Unsupervised MaskDiffusion with state-of-the-art

unsupervised segmentation methods. We use the Cityscapes [9] and COCO-
Stuff [5] datasets for comparison and report the mIoU. The clear boost in performance
by including the internal features underlines their semantic richness and effectiveness.

Training Cityscapes [9] COCO-Stuff [5]

Method Backbone pretraining
free 19class 27class 27class
IIC [15] - - - 6.4 6.7
PiCIE [8] - - - 12.3 13.8
STEGO [13] DINO [6] - - 21.0 28.2
HP [32] DINO [6] - - 18.4 24.6
√
k-means DINOv2 [26] 20.5 19.3 33.8
√
k-means SD [28] Internal feature 23.3 20.1 35.0
√
DiffSeg [35] SD [28] self-attention - 21.2 43.6
√
MaskDiffusion (U) SD [28] Internal feature 28.5 25.3 58.4

MaskCLIP [44], GEM [38], and our MaskDiffusion. MaskCLIP [44] and GEM [38]
result in a noisy and fragmented segmentation, whereas MaskDiffusion achieves
more consistent segments that better correspond with the shape of the object
and exhibit minimal noise.
To validate its performance, we conducted experiments comparing our Un-
supervised MaskDiffusion to its closest counterpart, DiffSeg [35] under the same
conditions of image unit segmentation as specified in their paper [35]. Following
other unsupervised segmentation methods [8, 13, 32, 35], we quantify the per-
formance of each method by employing the unsupervised mIoU, which matches
unlabeled clusters with ground-truth labels using a Hungarian matching algo-
rithm. We test on Cityscapes [9] with 19 and 27 classes and COCO-Stuff [5] with
27 classes. For a fair comparison, we input an empty string as a prompt.
The experimental results are shown in Table 3, which confirms that our
proposed Unsupervised MaskDiffusion outperforms conventional segmentation
methods using diffusion models and further underscores the effectiveness of in-
ternal features and their similarity map.

4.3 Internal Feature

In this section, we experimentally investigate the role of the internal features,

which are represented by 1024-dimensional vectors for each pixel. To prove that
these internal features hold sufficient semantic information, we conduct k-means
segmentation on a pixel-wise basis. To that end, we choose a specific number of
k-means clusters for each dataset.
For comparison, we use common unsupervised segmentation methods [8, 13,
15, 32] as well as k-means on DINOv2 [26]. The experimental setup is the same
as in the previous section.
The results of our experiments are presented in Table 3, highlighting the
strong performance of the internal features across diverse datasets. Particularly,
12 Y. Kawano and Y. Aoki.

Unsupervised Unsupervised
input k-means input k-means
MaskDiffusion MaskDiffusion

Fig. 6: Overview result of Unsupervised MaskDiffusion and k-means on in-

ternal features with COCO-Stuff [5] and Cityscapes [9]. It demonstrates that
Unsupervised MaskDiffusion produces qualitatively cleaner results.

the results demonstrate that applying k-means on the internal features signifi-
cantly outperforms traditional unsupervised segmentation, for example by 6.8%
for the COCO-Stuff dataset. Moreover, the performance exceeds that of DINOv2
k-means across all datasets, showcasing the robust capabilities of the internal fea-
tures without the need for specific training. Collectively, our findings establish
the semantic nature of the internal features.
Figure 6 shows the clustering results of Unsupervised MaskDiffusion and
internal features via k-means. It demonstrates that Unsupervised MaskDiffusion
can more cleanly classify identical classes into the same segment and different
classes into separate segments, compared to k-means of internal feature.

4.4 Open Vocabulary Segmentation on Web-Crawled Images

In this section, we thoroughly assess the performance of MaskDiffusion using

open vocabulary segmentation experiments on web-crawled images, where we
test the model’s ability to accurately segment various unseen classes, including
specific and detailed categories such as ’joker’ and ’porsche’.
The qualitative outcomes of the experiments are visually depicted in Figure 7.
In this figure, (a) represents a general image, while (b) and (c) showcase images
created by Stable Diffusion. Furthermore, (d) and (e) exhibit images containing
proper nouns, with the respective class entered as a prompt displayed below each
image. Our results highlight the successful segmentation of challenging concepts
such as ’mirror’ in (a) and rare segmentation tasks such as ’astronaut’ in (b).
Additionally, the model demonstrates the capability to identify general classes, as
Abbreviated paper title 13

(a) (b) (c)

Input MaskDiffusion Input MaskDiffusion Input MaskDiffusion

cat, mirror, background horse, astronaut, background broccoli, tomato, potato,

background
(d) (e)
Input MaskDiffusion
Input MaskDiffusion

lamborghini, porsche, background

joker, batman, background

Fig. 7: Open-vocabulary segmentation results. In (a) we test on a general image,

(b) and (c) show images generated by Stable Diffusion, and (d) and (e) are images
featuring specific proper nouns. The corresponding prompt for each class is displayed
below the respective image. The successful segmentation of challenging concepts, rare
classes, and proper nouns highlights the effectiveness of MaskDiffusion in handling
diverse segmentation tasks.

depicted in (c), indicating that its segmentation performance improves with more
general classes. Impressively, the segmentation of proper nouns is also achievable,
as evidenced in results (d) and (e). These findings serve as compelling evidence
that MaskDiffusion exhibits a robust capability for accurately segmenting open
vocabularies, including complex and specific categories, thus underscoring its
versatility and effectiveness in handling diverse and intricate segmentation tasks.

4.5 Ablation Study

In this section, we perform ablation studies on MaskDiffusion, examining how

various Stable Diffusion [28] time steps and UNet outputs for internal features
affect performance, using the 19class Cityscapes [9] dataset.
Table 4 presents the results of the ablation experiments comparing the effect
of using MaskDiffusion with different time steps on the mIoU. The results reveal
that employing a time step of t = 1 yields the most favorable outcome. Typically,
in Stable Diffusion [28], larger values of t are employed when restoring images
closer to Gaussian noise, requiring an attention map that broadly identifies the
location of the object. Conversely, smaller values of t are utilized to restore
images closer to the original image, thereby producing an attention map that
closely resembles the shape of the object in the real image.
Furthermore, we conduct ablation experiments on the cross-attention map
and internal features, which involve examining the concatenated outputs of dif-
14 Y. Kawano and Y. Aoki.

input 128 64

32 16, 8 32, 16, 8

Fig. 8: Ablation study on internal feature size. Qualitative results of internal

features k-means show that earlier layers of the UNet (128, 64) do not provide an
appropriate segmentation, whereas the inner layers (32, 16, 8) produce semantic mean-
ingful results.

Table 5: Ablation study on out-

Table 4: Ablation study on Diffu- put size of the cross-attention
sion time step. We evaluate MaskD- map. We evaluate MaskDiffusion on
iffusion on the Cityscapes dataset [9] the Cityscapes dataset [9] using mIoU.
using mIoU. Cross-attention map size
mIoU
64 32 16,8
t 1 10 100 500 √ √ √
12.5
mIoU 17.1 5.5 2.4 0.9 √ √
14.5
√
17.1

ferent layers. Considering the structure of UNet, the outer intermediate outputs
are of larger size compared to the smaller inner intermediate outputs. Therefore,
we compare various combinations at different positions, which are detailed in
Table 5. Overall, combining the cross-attention maps with a resolution of 8 × 8
and 16 × 16 yields the best results.
Similarly, we examine the effect of combining different internal feature sizes
by applying k-means and evaluating the resulting segmentation with the unsu-
pervised mIoU. The quantitative results are summarized in Table 6, indicating
that combining feature maps of size 8, 16, and 32 clearly outperforms other size
combinations. Figure 8 presents the qualitative results of the ablation study.
Since earlier layers of the U-Net typically generate low-level features, the output
is less suitable for segmentation, whereas inner outputs provide more semantic
information.
These findings highlight the critical role that both, the time step and the
internal features, play in the segmentation performance of MaskDiffusion.
Next, to explore the limitations of the internal features f , we created rep-
resentative f using images and ground truth on the training dataset. In this
experiment, f is first created from images in the training dataset, and then the
Abbreviated paper title 15

Table 6: Ablation study on U-Net layers to be adopted as internal features.

We perform k-means on various U-Net layer outputs and evaluate on the Cityscapes
dataset [9] using the unsupervised mIoU.
Internal feature size Unsupervised
128 64 32 16,8 mIoU
√
13.8
√
14.8
√
17.2
√
20.3
√ √ √ √
14.8
√ √ √
14.8
√ √
23.3

Table 7: Additional experiment results of MaskDiffusion. we conduct an ex-

periment calculating representative f with ground truth and using only the classes
contained in the images as prompt input (dynamic prompts).
Cityscapes [9] PascalVOC [11] COCO-Stuff [5]
Method
19class 20class 27class
MaskDiffusion 17.1 29.9 13.0
MaskDiffusion w/ ground truth 35.0 53.7 39.8
MaskDiffusion w/ dynamic prompts 21.6 87.2 40.5

f and ground truth are used to create a representative f . Representative f is the

mean of f for each class in ground truth. In the same context, we also conduct
an experiment using only the classes contained in the images as prompt input.
The evaluation results are shown in Table 7.
Our method achieves 30-60% of the performance compared to when cal-
culating representative f with ground truth, without training from the ground
truth. Furthermore, it is observed that using dynamic prompts leads to improved
performance in proportion to the number of classes per image. Cityscapes [9],
COCO-Stuff [5] contain about 10 objects per image, whereas VOC [11] contains
a few objects. mIoU increase is largely influenced by the number of objects in
one image, especially with dynamic prompts.

5 Limitation
While MaskDiffusion achieves remarkable results, especially compared to related
approaches, our proposed method still comes with certain limitations. First, the
cross-attention map has a low level of assigning internal features to each class.
This can be seen in Figure 5(b), where the segmentation is clean but the class
assignment is incorrect. Second, our setting assumes that potential candidates for
the classes appearing in the images are known beforehand. While we consider this
to be outside of the scope of this paper, it would be possible to use MaskDiffusion
in conjunction with models that are able to detect the presence of objects in
images, e.g. CLIP [27], to solve this limitation.
16 Y. Kawano and Y. Aoki.

6 Conclusion

In this study, we proposed MaskDiffusion, a novel approach for semantic segmen-

tation that leverages Stable Diffusion and internal features to achieve superior
segmentation results. Through a series of comprehensive experiments and anal-
yses, we have demonstrated the effectiveness and versatility of MaskDiffusion
across various datasets and challenging segmentation tasks. Our results indicate
that MaskDiffusion exhibits robust performance in handling diverse categories,
including general classes and fine-grained, proper noun-based segments.
Overall, our findings highlight the potential of MaskDiffusion as a power-
ful and effective tool in the field of semantic segmentation. By leveraging the
strengths of Stable Diffusion and internal features, we have successfully demon-
strated the model’s capability to handle diverse datasets and open vocabularies,
paving the way for future advancements and applications in this critical area of
computer vision.
Abbreviated paper title 17

References
1. Isprs. isprs 2d semantic labeling contest. http://www2.isprs.org/commissions/
comm3/wg4/semantic-labeling.html 3, 10
2. Amit, T., Shaharbany, T., Nachmani, E., Wolf, L.: Segdiff: Image segmentation
with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021) 2, 5
3. Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A.:
Label-efficient semantic segmentation with diffusion models. arXiv preprint
arXiv:2112.03126 (2021) 2, 5
4. Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Ad-
vances in Neural Information Processing Systems 32 (2019) 4
5. Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 1209–1218 (2018) 3, 10, 11, 12, 15
6. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin,
A.: Emerging properties in self-supervised vision transformers. In: Proceedings of
the International Conference on Computer Vision (2021) 1, 4, 11
7. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017) 3
8. Cho, J.H., Mall, U., Bala, K., Hariharan, B.: Picie: Unsupervised semantic seg-
mentation using invariance and equivariance in clustering. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021) 1, 4,
11
9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3213–3223 (2016) 2, 10, 11, 12, 13, 14, 15
10. Ding, Z., Wang, J., Tu, Z.: Open-vocabulary universal image segmentation with
maskclip (2023) 1
11. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman,
A.: The pascal visual object classes challenge: A retrospective. International journal
of computer vision 111, 98–136 (2015) 10, 15
12. Feng, Q., Gadde, R., Liao, W., Ramon, E., Martinez, A.: Network-free, unsu-
pervised semantic segmentation with synthetic images. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23602–
23610 (2023) 5
13. Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised
semantic segmentation by distilling feature correspondences. In: International Con-
ference on Learning Representations (2022), https://openreview.net/forum?id=
SaKO6z6Hl0c 1, 4, 11
14. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in
neural information processing systems 33, 6840–6851 (2020) 2, 3
15. Ji, X., Henriques, J.F., Vedaldi, A.: Invariant information clustering for unsuper-
vised image classification and segmentation. In: Proceedings of the IEEE Interna-
tional Conference on Computer Vision. pp. 9865–9874 (2019) 4, 11
16. Jiang, P., Gu, F., Wang, Y., Tu, C., Chen, B.: Difnet: Semantic segmentation by
diffusion networks. Advances in Neural Information Processing Systems 31 (2018)
5
17. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. pp. 4401–4410 (2019) 5
18 Y. Kawano and Y. Aoki.

18. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 8110–8119 (2020) 5
19. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114 (2013) 3
20. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint
arXiv:2304.02643 (2023) 1, 4
21. Li, D., Yang, J., Kreis, K., Torralba, A., Fidler, S.: Semantic segmentation with gen-
erative models: Semi-supervised learning and strong out-of-domain generalization.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 8300–8311 (2021) 5
22. Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P.,
Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 7061–7070 (2023) 1, 2
23. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3431–3440 (2015) 3
24. Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 7086–7096 (2022) 4
25. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm.
Advances in neural information processing systems 14 (2001) 3, 8, 9
26. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V.,
Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust
visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 1, 11
27. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning. pp.
8748–8763. PMLR (2021) 2, 3, 4, 6, 9, 15
28. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2,
3, 4, 6, 7, 9, 11, 13
29. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc-
tober 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015) 3
30. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour,
K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-
to-image diffusion models with deep language understanding. Advances in Neural
Information Processing Systems 35, 36479–36494 (2022) 2
31. Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta,
A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-
filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 3,
9
32. Seong, H.S., Moon, W., Lee, S., Heo, J.P.: Leveraging hidden positives for unsu-
pervised semantic segmentation (June 2023) 1, 4, 11
33. Shin, G., Xie, W., Albanie, S.: Reco: Retrieve and co-segment for zero-shot transfer.
Advances in Neural Information Processing Systems 35, 33754–33767 (2022) 4
Abbreviated paper title 19

34. Tan, W., Chen, S., Yan, B.: Diffss: Diffusion model for few-shot semantic segmen-
tation. arXiv preprint arXiv:2307.00773 (2023) 5
35. Tian, J., Aggarwal, L., Colaco, A., Kira, Z., Gonzalez-Franco, M.: Diffuse, attend,
and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv
preprint arXiv:2308.12469 (2023) 3, 4, 5, 9, 11
36. Tritrong, N., Rewatbowornwong, P., Suwajanakorn, S.: Repurposing gans for one-
shot semantic part segmentation. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. pp. 4475–4485 (2021) 5
37. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro-
cessing systems 30 (2017) 3
38. Walid Bousselham, Felix Petersen, V.F.H.K.: Grounding everything: Emerg-
ing localization properties in vision-language transformers. arXiv preprint
arXiv:2312.00878 (2023) 3, 4, 10, 11
39. Wolleb, J., Sandkühler, R., Bieder, F., Valmaggia, P., Cattin, P.C.: Diffusion mod-
els for implicit image segmentation ensembles. In: International Conference on
Medical Imaging with Deep Learning. pp. 1336–1348. PMLR (2022) 5
40. Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: Diffumask: Synthesizing im-
ages with pixel-level annotations for semantic segmentation using diffusion models.
arXiv preprint arXiv:2303.11681 (2023) 5
41. Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary
panoptic segmentation with text-to-image diffusion models. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2955–
2966 (2023) 1, 2, 4, 5
42. Zhang, J., Herrmann, C., Hur, J., Cabrera, L.P., Jampani, V., Sun, D., Yang, M.H.:
A tale of two features: Stable diffusion complements dino for zero-shot semantic
correspondence. arXiv preprint arXiv:2305.15347 (2023) 5
43. Zhaoa, Z., Wang, Y., Liu, K., Yang, H., Sun, Q., Qiao, H.: Semantic segmenta-
tion by improved generative adversarial networks. arXiv preprint arXiv:2104.09917
(2021) 5
44. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from clip. In: European
Conference on Computer Vision. pp. 696–712. Springer (2022) 1, 2, 4, 10, 11
45. Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Gao, J., Lee, Y.J.: Segment everything
everywhere all at once. arXiv preprint arXiv:2304.06718 (2023) 4