KEMBAR78
SDFusion | PDF | 3 D Computer Graphics
0% found this document useful (0 votes)
33 views10 pages

SDFusion

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views10 pages

SDFusion

Uploaded by

auser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Yen-Chi Cheng1 Hsin-Ying Lee2 Sergey Tulyakov2 Alexander Schwing1∗ Liangyan Gui1∗
1
University of Illinois Urbana-Champaign 2 Snap Research
{yenchic3,aschwing,lgui}@illinois.edu {hlee5,stulyakov}@snap.com
https://yccyenchicheng.github.io/SDFusion/
arXiv:2212.04493v2 [cs.CV] 22 Mar 2023

Text-guided
Completion

Missing
Completion
Shape

square couch with shelf table with net


Recon.
3D

cheesecake wooden table table tennis table


Text-guided
Generation

a rocking chair square table, curved legs

condition
strength
Multi-condition

ice cream summer house Japanese castle

a chair with arms


Text-guided Colorization
Figure 1. Applications of SDFusion. The proposed diffusion-based model enables various applications. (left) SDFusion can generate
shapes conditioned on different input modalities, including partial shapes, images, and text. SDFusion can even jointly handle multiple
conditioning modalities while controlling the strength for each of them. (right) We leverage pretrained 2D models to texture 3D shapes
generated by SDFusion.

Abstract mechanism. Due to its flexibility, our model naturally sup-


In this work, we present a novel framework built to sim- ports a variety of tasks, outperforming prior works on shape
plify 3D asset generation for amateur users. To enable in- completion, image-based 3D reconstruction, and text-to-
teractive generation, our method supports a variety of in- 3D. Most interestingly, our model can combine all these
put modalities that can be easily provided by a human, in- tasks into one swiss-army-knife tool, enabling the user to
cluding images, text, partially observed shapes and com- perform shape generation using incomplete shapes, images,
binations of these, further allowing to adjust the strength and textual descriptions at the same time, providing the rel-
of each input. At the core of our approach is an encoder- ative weights for each input and facilitating interactivity.
decoder, compressing 3D shapes into a compact latent rep- Despite our approach being shape-only, we further show an
resentation, upon which a diffusion model is learned. To efficient method to texture the generated shape using large-
enable a variety of multi-modal inputs, we employ task- scale text-to-image models.
specific encoders with dropout followed by a cross-attention

1
1. Introduction to a recently proposed autoregressive model [25] that also
adopts an encoded latent space, SDFusion achieves supe-
Generating 3D assets is a cornerstone of immersive aug- rior sample quality, while offering more flexibility to handle
mented/virtual reality experiences. Without realistic and di- multiple conditions and, at the same time, features reduced
verse objects, virtual worlds will look void and engagement memory usage. With SDFusion, we study the interplay be-
will remain low. Despite this need, manually creating and tween models trained on 2D and 3D data. Given 3D shapes
editing 3D assets is a notoriously difficult task, requiring generated by SDFusion, we take advantage of an off-the-
creativity, 3D design skills, and access to sophisticated soft- shelf 2D diffusion model [34], neural rendering [24], and
ware with a very steep learning curve. This makes 3D asset score distillation sampling [31] to texture the shapes given
creation inaccessible for inexperienced users. Yet, in many text descriptions as conditional variables.
cases, such as interior design, users more often than not
We conduct extensive experiments on the ShapeNet [7],
have a reasonably good understanding of what they want to
BuildingNet [38], and Pix3D [43] datasets. We show that
create. In those cases, an image or a rough sketch is some-
SDFusion quantitatively and qualitatively outperforms prior
times accompanied by text indicating details of the asset,
work in shape completion, 3D reconstruction from images,
which are hard to express graphically for an amateur.
and text-to-shape tasks. We further demonstrate the capa-
Due to this need, it is not surprising that democratizing
bility of jointly controlling the generative model via multi-
the 3D content creation process has become an active re-
ple conditioning modalities, the flexibility of adjusting rela-
search area. Conventional 3D generative models require
tive weight among modalities, and the ability to texture 3D
direct 3D supervision in the form of point clouds [2, 21],
shapes given textual descriptions, as shown in Figure 1.
signed distance functions (SDFs) [9, 25], voxels [42, 47],
We summarize the main contributions as follows:
etc. Recently, first efforts have been made to explore the
learning of 3D geometry from multi-view supervision with • We propose SDFusion, a diffusion-based 3D genera-
known camera poses by incorporating inductive biases via tive model which uses a signed distance function as its
neural rendering techniques [5, 6, 14, 37, 52]. While com- 3D representation and a latent space for diffusion.
pelling results have been demonstrated, training is often • SDFusion enables conditional generation with multi-
very time-consuming and ignores available 3D data that can ple modalities, and provides flexible usage by adjust-
be used to obtain good shape priors. We foresee an ideal ing the weight among modalities.
collaborative paradigm for generative methods where mod- • We demonstrate a pipeline to synthesize textured 3D
els trained on 3D data provide detailed and accurate geom- objects benefiting from an interplay between 2D and
etry, while models trained on 2D data provide diverse ap- 3D generative models.
pearances. A first proof of concept is shown in Figure 1.
In our pursuit of flexible and high-quality 3D shape gen- 2. Related Work
eration, we introduce SDFusion, a diffusion-based genera- 3D Generative Models. Different from 2D images, it is
tive model with a signed distance function (SDF) under the less clear how to effectively represent 3D data. Indeed,
hood, acting as our 3D representation. Compared to other various representations with different pros and cons have
3D representations, SDFs are known to represent well high- been explored, particularly when considering 3D genera-
resolution shapes with arbitrary topology [9, 18, 23, 30]. tive models. For instance, 3D generative models have been
However, 3D representations are infamous for demanding explored for point clouds [2, 21], voxel grids [20, 42, 47],
high computational resources, limiting most existing 3D meshes [51], signed distance functions (SDFs) [9, 11, 25],
generative models to voxel grids of 323 resolution and point etc. In this work, we aim to generate an SDF. Compared
clouds of 2K points. To side-step this issue, we first uti- to other representations, SDFs exhibit a reasonable trade-
lize an auto-encoder to compress 3D shapes into a more off regarding expressivity, memory efficiency, and direct
compact low-dimensional representation. Because of this, applicability to downstream tasks. Moreover, conditioning
SDFusion can easily scale up to a 1283 resolution. To 3D generation of SDFs on different modalities further en-
learn the probability distribution over the introduced la- ables many applications, including shape completion, 3D
tent space, we leverage diffusion models, which have re- reconstruction from images, 3D generation from text, etc.
cently been used with great success in various 2D genera- The proposed framework can handle these tasks in a single
tion tasks [4,19,22,26,35,40]. Furthermore, we adopt task- model which makes it different from prior work.
specific encoders and a cross-attention [34] mechanism to Recently, thanks to the advancement of neural render-
support multiple conditioning inputs, and apply classifier- ing [24], a new stream of research has emerged to learn
free guidance [17] to enable flexible conditioning usage. 3D generation and manipulation from only 2D supervi-
Because of these strategies, SDFusion can not only use a va- sion [1, 5, 6, 28, 37, 39, 41, 49]. We believe the interplay
riety of conditions from multiple modalities, but also adjust between two streams of work is promising in the foresee-
their importance weight, as shown in Figure 1. Compared able future.

2
Input Encoder Decoder Output Multi-modality Conditioning
𝑇−1 × Denoise

𝑧 𝑧𝑇 𝑧
Diffusion Attn Attn (𝑠1 , 𝑠2 ) = (0,0)
𝐸𝜑 process 𝐷𝜏
𝜖𝜃
Condition × 𝑠1
Dropout
Skip (𝑠1 , 𝑠2 ) = (1,0)
𝐸𝜙1 Concat.
connection

“a brick Task Denoising “chair with


𝐸𝜙2 encoders 3D UNet
house” three legs” × 𝑠2
(𝑠1 , 𝑠2 ) = (0,1)
Figure 2. SDFusion Overview. (left) To enable high-resolution generation, we first encode 3D shapes into a latent space, where a diffu-
sion model is trained. Furthermore, to enable flexible conditional generation, we adopt class-specific encoders along with classifier-free
guidance to enable multi-modality conditioning. (right) At inference time, we can control the importance of each conditioning modality.

Diffusion Models. Diffusion models have recently shape distributions via diffusion models feasible, we com-
emerged as a popular family of generative models with press the 3D shape X into a lower-dimensional yet compact
competitive sample quality. In particular, diffusion models latent space. For this, we leverage a 3D-variant of the Vec-
have shown impressive quality, diversity, and expressive- tor Quantised-Variational AutoEncoder (VQ-VAE) [29].
ness in various tasks such as image synthesis [13,15,16,27], Specifically, the employed 3D VQ-VAE contains an en-
super-resolution [36], image editing [10, 22, 40], text-to- coder Eφ to encode the 3D shape into the latent space,
image synthesis [4, 19, 26, 33, 35], etc. In contrast to the and a decoder Dτ to decode the latent vectors back to 3D
flourishing research on diffusion models for 2D data, space. Given an input shape represented via the T-SDF
diffusion models have not yet been fully explored for X ∈ RD×D×D , we have
3D data. Notable exceptions include attempts to apply
\begin {aligned} \bfz = E_{\varphi }(\bfX ), \quad \text {and}\quad \bfX ^{\prime } = D_{\tau }(\text {VQ}(\bfz )), \end {aligned} (1)
diffusion models to point clouds [21, 53].
Differently, in this work, we apply diffusion models on where z ∈ Rd×d×d is the latent vector, latent dimension d
SDF representations. As reasonable resolutions of SDFs are is smaller than 3D shape dimension D, and VQ is the quan-
demanding to model, we study the use of a latent diffusion tization step which maps the latent variable z to the near-
technique [34] and the classifier-free conditional generation est element in the codebook Z. The encoder Eφ , decoder
mechanism [17], both of which have been shown to yield Dτ , and codebook Z are jointly optimized. We pre-train
promising results when being used in 2D diffusion models. the VQ-VAE with reconstruction loss, commitment loss,
and VQ objective, similar to [29] using ShapeNet or Build-
3. Approach ingNet data.
We aim at synthesizing 3D shapes using diffusion mod- 3.2. Latent Diffusion Model for SDF
els. Towards this goal, we model the distribution over 3D
shapes X, a volumetric Truncated Signed Distance Field (T- Using the trained encoder Eφ , we can now encode any
SDF). However, applying diffusion models directly on rea- given SDF into a compact and low-dimensional latent vari-
sonably high-resolution 3D shapes is computationally very able z = Eφ (X). We can then train a diffusion model
demanding. Therefore, we first compress the 3D shape into on this latent representation. Fundamentally, a diffusion
a discretized and compact latent space (Section 3.1). This model learns to sample from a target distribution by revers-
allows us to apply diffusion models in a lower-dimensional ing a progressive noise diffusion process. Given a sample
space (Section 3.2). The proposed framework can further z, we obtain zt , t ∈ {1, . . . , T } by gradually adding Gaus-
incorporate various user conditions such as partial shapes, sian noise with a variance schedule. Then we use a time-
images, and text (Section 3.3). Finally, we showcase an conditional 3D UNet ϵθ as our denoising model. To train
interplay between the proposed framework and diffusion the denoising 3D UNet, we adopt the simplified objective
models trained on 2D data to texture 3D shapes (Sec- proposed by Ho et al. [15]:
tion 3.4). \begin {aligned} L_{\mathrm {simple}}(\theta ) \coloneqq \mathbb {E}_{\bfz , \epsilon \sim N(0, 1), t} \left [ \norm {\epsilon - \epsilon _\theta (\bfz _t, t) }^2 \right ]. \end {aligned} (2)
3.1. 3D Shape Compression of SDF
At inference time, we sample b z by gradually denoising a
A 3D shape representation is high-dimensional and thus noise variable sampled from the standard normal distribu-
difficult to model. To make learning of high-resolution 3D tion N (0, 1), and leverage the trained decoder Dτ to map

3
the denoised code bz back to a 3D T-SDF shape representa- Densities 𝜎 NeRF Weights “wooden house with snow”

tion X
b = Dτ (bz), as shown in Figure 2. 𝐹𝛩 ∇𝛩
Gradients for
3.3. Learning the Conditional Distribution NeRF
Stable Diffusion
Being able to randomly sample shapes provides lim- 𝛼𝛷𝛽 −𝑿
ited ability for interaction. Therefore, learning of a con-
ditional distribution is essential for user applications. Im- +
portantly, multiple forms of conditional inputs are desirable
such that the model can account for various kinds of scenar- SDF 𝑿 RGB rendering 𝜖~𝑁(0, 1)
ios. Thanks to the flexible conditional mechanism provided
by a latent diffusion model [34], we can incorporate mul- Figure 3. 3D Shape Texturing. We demonstrate an application
tiple conditional input modalities at once with task-specific where models trained on 2D and 3D data are combined. The
encoders Eϕ and a cross-attention module. To further allow shapes generated by SDFusion are converted to a density tensor,
for more flexibility in controlling the distribution, we adopt then the color information is learned via neural rendering. The gra-
dients are provided by an off-the-shelf 2D diffusion model [34].
classifier-free guidance for conditional generation. The ob-
jective function reads as follows:
2D model to perform 3D synthesis is made possible with a
\begin {aligned} L(\theta , \{\phi _i\}) \coloneqq \mathop \mathbb {E}_{\bfz , \bfc , \epsilon , t} \left [ \norm {\epsilon - \epsilon _\theta (\bfz _t, t, F\{D\circ E_{\phi _i}(\bfc _i)\}) }^2 \right ], \end {aligned} score distillation sampling technique [31].
Here, we illustrate the procedure of texturing a 3D shape
(3) given a guiding input sentence S. Starting from a generated
where Eϕi (ci ) is the task-specific encoder for the ith modal- T-SDF X, we first convert it to a density tensor σ by us-
ity, D is a dropout operation enabling classifier-free guid- ing VolSDF [50]. Our goal is to learn a 5D radiance field
ance, and F is a feature aggregation function. In this work, to obtain color Fθ : (x, d) → c, similar to the conven-
F refers to a simple concatenation. tional NeRF [24] setting. However, different from the con-
At inference time, given conditions from multiple ventional NeRF setting, the density tensor is fixed. With a
modalities, we perform classifier-free guidance as follows sampled camera pose, we can render an image I by alpha-
compositing the densities and colors along the rays for each
\begin {aligned} &\epsilon _\theta (\bfz _t, t, F\{E_{\phi _i}~\forall i\}) = \epsilon _\theta (\bfz _t, t, \boldsymbol {\emptyset })+\\ & \sum _i s_i\left (\epsilon _\theta (\bfz _t, t, F\{E_{\phi _i}(\bfc _i), E_{\phi _j}(\bfc _j) : \bfc _j= \boldsymbol {\emptyset } ~\forall j\neq i\}) \right .\\ &\left .- \epsilon _\theta (\bfz _t, t, \boldsymbol {\emptyset })\right ), \end {aligned} pixel. Then, we distill the knowledge from a pre-trained sta-
ble diffusion model [34], denoted as a 2D UNet ϵ̃ϕ (zt , t, S)
that predicts the noise at timestep t. The most straightfor-
ward way to update Fθ is to obtain and backpropagate gra-
dients all the way from the noise prediction loss term of
(4) the stable diffusion loss function to the input. However, the
where si denotes the weight of conditions from the ith UNet Jacobian term is in practice expensive to compute.
modality and ∅ denotes a condition filled with zeros. Intu- Therefore, it was proposed in [31] to bypass the UNet and
itively, modalities with larger weights play more important treat ϵ̃ϕ as a frozen critic providing scores by computing
roles in guiding the conditional generation.
In this work, we study SDFusion combined with three
\mathbb {E}_{t,\epsilon }\left [\frac {\partial I(F_\theta (\cdot ))}{\partial \theta }w(t) (\tilde {\epsilon }_\phi (\bfz _t,t,S)-\epsilon ) \right ], (5)
conditional modalities applied separately or jointly. For
shape completion, given a partial observation of a shape, we
where w is a time-dependent weight function defined in the
perform blended diffusion similar to [4]. For single-view 3D
stable diffusion model. The mechanism is called Score Dis-
reconstruction, we adopt CLIP [32] as the image encoder.
tillation Sampling. Please refer to [31] for details. The score
For text-guided 3D generation, we adopt BERT [12] as the
then provides an update direction to Fθ . We illustrate the
text encoder. The encoded features are then used to modu-
process in Figure 3.
late the diffusion process with cross-attention.
3.4. 3D Shape Texturing with a 2D Model 4. Experiments
While sampled 3D data often exhibits compellingly de- In this section, we conduct extensive qualitative and
tailed geometry, textures of 3D data are generally more quantitative experiments to demonstrate the efficacy and
difficult to collect and even more often of limited quality. generalizability of SDFusion. We evaluate methods on three
Here, we explore how to make the best use of 2D data and tasks: shape completion, single-view 3D reconstruction,
models, so as to aid 3D asset generation. Thanks to the re- and text-guided generation. We then demonstrate two addi-
cent success of neural rendering [24] and 2D text-to-image tional use cases: multi-conditional generation and 3D shape
models trained on extremely large-scale data [34], using a texturing.

4
Input Ours AutoSDF [25] MPC [45]

Missing

Input Multimodal Shape Completion Input Multimodal Shape Completion


Figure 4. Shape Completion. (Top) We compare SDFusion with AutoSDF [25] and MPC [45] on ShapeNet and BuildingNet data.
SDFusion generates shapes of better quality and diversity, while being consistent with the input partial shapes. We convert the generated
SDFs from SDFusion and AutoSDF to point clouds to compare with MPC. (Bottom) We present more results on diverse shape completion
using various object categories.

Table 1. Quantitative comparison of Shape Completion. We Table 2. Quantitative evaluation of single-view reconstruction.
evaluate methods on fidelity (UHD) and diversity (TMD) using We evaulate methods on the Pix3D dataset using Chamfer Dis-
the ShapeNet and BuildingNet data. SDFusion outperforms other tance and F-Score. We outperform other methods in both metrics.
methods in both metrics, especially diversity.
Method CD  F-Score 
ShapeNet BuildingNet
Pix2Vox [46] 3.001 0.385
Method UHD  TMD  UHD  TMD  ResNet2TSDF 4.582 0.351
MPC [45] 0.0627 0.0303 0.1350 0.0467 ResNet2Voxel 4.670 0.357
AutoSDF [25] 0.0567 0.0341 0.1208 0.0649 AutoSDF [25] 2.267 0.415
Ours 0.0557 0.0885 0.1116 0.0745 Ours 1.852 0.432

higher resolution for representing the data. Hence, we use


4.1. Shape Completion
an SDF of 643 resolution for ShapeNet, and a 1283 reso-
We evaluate the shape completion task on the lution for BuildingNet. For both datasets, we compare the
ShapeNet [7] and the BuildingNet [38] datasets. ShapeNet completion quantitatively by providing the bottom half of
is a large-scale 3D CAD model dataset with 16 common the ground truth shape as input, and evaluating the com-
object classes. We use the train/test splits provided by Xu pleted shapes generated by different methods.
et al. [48]. BuildingNet is a new large-scale 3D building We compare SDFusion to the state-of-the-art point cloud
model dataset. Compared to objects in ShapeNet, build- completion method MPC [45] and the autoregressive SDF
ing models provide more geometric details and thus require generation method AutoSDF [25]. We adopt metrics from

5
Input GT Voxel Ours AutoSDF [25] ResNet2Vox ResNet2SDF Pix2Vox [46]

Input Output Input Output Input Output Input Outputs

Figure 5. Single-view 3D Reconstruction. (Top) We qualitatively compare all methods on the Pix3D dataset. SDFusion generates shapes
with the best visual quality. (Bottom) Here we present more reconstruction results from SDFusion.
Input Ours AutoSDF [25]

“chair with
five legs”

“a somewhat circular chair” “an L shape table”

“a round table with two surfaces” “a chair with pillow for back support”

Figure 6. Text-guided 3D shape generation. (Top) We compare SDFusion with AutoSDF on the Text2Shape dataset. SDFusion generates
shapes with higher quality while conforming to the description. (Bottom) We present more diverse and interesting text-guided generation
results from SDFusion.

MPC [45]. For each partial shape, we generate k complete Table 3. Quantitative evaluation of text-guided generation. We
shapes. To evaluate completion fidelity, we measure the compare the proposed SDFusion method, AutoSDF, and real data
Unidirectional Hausdorff Distance (UHD) between partial using a pretrained neural evaluator on the ShapeGlot dataset. P
denotes the neural evaluator preference rate for the target P (Tr)
shapes and generated shapes. To evaluate completion di-
or the distractor P (Dis). If the preference is too close (≤ 0.2), we
versity, we measure the Total Mutual Difference (TMD) by count the comparison as confused (conf.). Rows sum to 100%.
computing the average Chamfer distance among k gener-
ated shapes. We use k = 10 in the experiments.
Target (Tr) Distractor (Dis) P (Tr) P (Dis) P (conf.)
As shown in Table 1, the proposed SDFusion performs
Ours AutoSDF [25] 49% 36% 15%
favorably compared to all methods in the completion fi- Ours GT 33% 45% 22%
delity metric, and outperforms the baselines substantially in AutoSDF [25] GT 30% 49% 21%
completion diversity. The advantages of fidelity and diver-
sity are also apparent in Figure 4. Especially on the Build-
ingNet dataset, SDFusion shows its advantages in modeling
high-resolution and diverse data. AutoSDF and MPC strug-
gle to model the distribution correctly.

6
Partial Shape + Text → Outputs Partial Shape + Image → Outputs
“triangular
table”

“chair with
many legs”

Figure 7. Conditional generation from multiple modalities. Here, we present generated samples from joint conditions of (left) partial
shape and text and (right) partial shapes and images. The generated results are diverse and consistent with the provided conditions.

4.2. Single-view 3D Reconstruction son as confused (conf.).


Next, we assess 3D shape reconstruction from a single As shown in Table 3, SDFusion quantitatively outper-
image on the real-world benchmark Pix3D [43] dataset. We forms AutoSDF by a large margin with a low confusion
use the provided train/test splits on the chair category. In the rate. SDFusion also performs better than AutoSDF when
absence of official splits for other categories, we randomly compared with ground truth data. Qualitatively, we show
split the dataset into disjoint train/test splits. in Figure 5 that SDFusion not only generates shapes with
better quality, but that the generated shapes are also more
We compare with the ResNet2TSDF and ResNet2Voxel
diverse. Notably, SDFusion reacts to very specific descrip-
baselines. Both encode images and directly output 3D
tions like “L-shaped table” and “table with two surfaces.” It
shapes in the form of a T-SDF and a voxel grid. We also
generates objects of high diversity while remaining faithful
compare to two state-of-the-art methods for 3D reconstruc-
to the provided description.
tion, i.e., Pix2Vox [46] and AutoSDF [25] . We evaluate all
methods after aligning resolutions to 323 voxels. We use 4.4. Multi-conditional Generation
Chamfer Distance (CD), and F-score@1% [44] as evalua-
In addition to the conditional generation tasks which take
tion metrics.
a single conditioning variable as input, we further demon-
Quantitatively, as shown in Table 2, SDFusion outper-
strate the efficacy of SDFusion in handling multiple modal-
forms other methods on both metrics. Qualitatively, SDFu-
ities. First, SDFusion can jointly consider multiple condi-
sion generates 3D shapes that are of higher visual quality
tioning modalities. On the left of Figure 7, we present the
and are more visually consistent with the objects shown in
diverse generation conditional on both partial shapes and
the images, regardless of camera poses, as shown in Fig-
text. On the right of Figure 7, we show that given partial
ure 5.
shapes and images, SDFusion can complete the different
parts based on images. When there is ambiguity in images
4.3. Text-guided Generation
(e.g., rear-view of chairs), SDFusion can produce diverse
Next, we evaluate 3D shape generation conditioned predictions. Second, as shown in Figure 8, SDFusion can
on text input. For a qualitative comparison, we use the not only be jointly conditioned on multiple inputs, but a
Text2shape dataset [8] that provides descriptions for the weight can be used to control the importance of the con-
‘chair’ and ‘table’ categories in ShapeNet. For a quanti- ditioning modalities, enabling more flexible user control.
tative evaluation, we adopt the ShapeGlot [3] dataset which For example, for the left sample in Figure 8, the larger the
provides text utterances describing the difference between weight for the input image, the more similar the results are
a target shape and two distractors based on the ShapeNet to the shapes in the image. Similarly, the larger the weight
dataset. We compare SDFusion with AutoSDF, which for input text, the more “egg-shaped” the results. We envi-
recently demonstrated state-of-the-art results on the text- sion such a fine-grained form of control to be particularly
guided 3D shape generation task. useful for interactive user applications.
We follow the evaluation pipeline proposed by Shape-
Glot [3]. We train a neural evaluator to distinguish the tar- 4.5. 3D Shape Texturing
get shape from a distractor given the description. Given Finally, we showcase an application that uses SDFusion
two shapes from different methods, the neural evaluator to generate 3D shapes of detailed geometry, and uses a
provides a confidence score for each of them based on the pretrained text-to-image 2D diffusion model [34] to pro-
binary classification logits. For the absolute difference be- vide textures. As shown in Figure 9, the diffusion model
tween two confidence scores ≤ 0.2, we count the compari- pre-trained on large-scale 2D data can provide semantically

7
“egg or round shape” 𝑤txt “with one leg” 𝑤txt

Partial shape Partial shape

(0, 0) (0, 1) (0, 0) (0, 1)


𝑤img

𝑤img
(1, 0) (1, 1) (1, 0) (1, 1)

Figure 8. Multiple conditioning variables with weight control. Given a partial shape, an image, and a sentence as conditional input, we
show that the model is sensitive to weights that control the importance of different modalities.

“Chinese “gingerbread
temple” house”

“a palace carved
out of wood”
“house with
“house made of
Halloween
cream cheese”
decoration”

Figure 9. 3D Shape Texturing. We texture the generated 3D shapes with a 2D diffusion model trained on large-scale data. This permits to
generate textures from diverse textual inputs, including style and material descriptions. The pipeline can also generate diverse results given
the same input description.

meaningful and diverse guidance to texture the 3D shapes. that takes advantage of a pretrained 2D text-to-image model
The model shows superior expressiveness to interpret ab- to texture a generated 3D shape.
stract concepts (e.g., Chinese- and Halloween-style) and Although the results look promising and exciting, there
materials (e.g., gingerbread, cream cheese). Given a single are quite a few future directions for improvement. First,
description, the texturing pipeline can also generate diverse SDFusion is trained on high-quality signed distance func-
results, as shown in the rightmost part of Figure 9. tion representations. To make the model more general and
to enable the use of more diverse data, a model that operates
on various 3D representations simultaneously is desirable.
5. Conclusion
Another future direction is related to the diversity of the
In this work, we present SDFusion, an attempt to adopt data: we currently apply SDFusion on object-centric data.
diffusion models to signed distance functions for 3D shape It is interesting to apply the model to more challenging sce-
generation. To alleviate the computationally demanding na- narios (e.g., entire 3D scenes). Finally, we believe there is
ture of 3D representations, we first encode 3D shapes into room to further explore how to combine models trained on
an expressive low-dimensional latent space, which we use 2D and 3D data.
to train the diffusion model. To enable flexible conditional Acknowledgements: Work supported in part by NSF un-
usage, we adopt class-specific encoders along with a cross- der Grants 2008387, 2045586, 2106825, MRI 1725729, and
attention mechanism for handling conditions from multiple NIFA award 2020-67021-32799. Thanks to NVIDIA for
modalities, and leverage classifier-free guidance to facili- providing a GPU for debugging.
tate weight control among modalities. Foreseeing the po-
tential of a collaborative symbiosis between models trained
on 2D and 3D data, we further demonstrate an application

8
References models for high fidelity image generation. arXiv preprint
arXiv:2106.15282, 2021. 3
[1] Rameen Abdal, Hsin-Ying Lee, Peihao Zhu, Menglei Chai,
[17] Jonathan Ho and Tim Salimans. Classifier-free diffusion
Aliaksandr Siarohin, Peter Wonka, and Sergey Tulyakov.
guidance. arXiv preprint arXiv:2207.12598, 2022. 2, 3
3davatargan: Bridging domains for personalized editable
[18] Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker.
avatars. In CVPR, 2023. 2
Sdfdiff: Differentiable rendering of signed distance fields for
[2] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and 3d shape optimization. In CVPR, 2020. 2
Leonidas Guibas. Learning representations and generative
[19] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
models for 3d point clouds. In ICML, 2018. 2
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
[3] Panos Achlioptas, Judy Fan, Robert Hawkins, Noah Good- Text-based real image editing with diffusion models. arXiv
man, and Leonidas J Guibas. Shapeglot: Learning language preprint arXiv:2210.09276, 2022. 2, 3
for shape differentiation. In CVPR, 2019. 7
[20] Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei
[4] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, and Sergey
diffusion for text-driven editing of natural images. In CVPR, Tulyakov. Infinicity: Infinite-scale city synthesis. arXiv
2022. 2, 3, 4 preprint arXiv:2301.09637, 2023. 2
[5] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, [21] Shitong Luo and Wei Hu. Diffusion probabilistic models for
Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J 3d point cloud generation. In CVPR, 2021. 2, 3
Guibas, Jonathan Tremblay, Sameh Khamis, et al. Effi- [22] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-
cient geometry-aware 3d generative adversarial networks. In Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and
CVPR, 2022. 2 editing with stochastic differential equations. In ICLR, 2022.
[6] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, 2, 3
and Gordon Wetzstein. pi-gan: Periodic implicit genera- [23] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
tive adversarial networks for 3d-aware image synthesis. In bastian Nowozin, and Andreas Geiger. Occupancy networks:
CVPR, 2021. 2 Learning 3d reconstruction in function space. In CVPR,
[7] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat 2019. 2
Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano- [24] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
and Fisher Yu. ShapeNet: An Information-Rich 3D Model Representing scenes as neural radiance fields for view syn-
Repository. Technical Report arXiv:1512.03012 [cs.GR], thesis. In ECCV, 2020. 2, 4
Stanford University — Princeton University — Toyota Tech-
[25] Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shub-
nological Institute at Chicago, 2015. 2, 5
ham Tulsiani. AutoSDF: Shape priors for 3d completion,
[8] Kevin Chen, Christopher B Choy, Manolis Savva, An- reconstruction and generation. In CVPR, 2022. 2, 5, 6, 7
gel X Chang, Thomas Funkhouser, and Silvio Savarese. [26] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Text2shape: Generating shapes from natural language by Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
learning joint embeddings. In ACCV, 2018. 7 Mark Chen. Glide: Towards photorealistic image generation
[9] Zhiqin Chen and Hao Zhang. Learning implicit fields for and editing with text-guided diffusion models. arXiv preprint
generative shape modeling. In CVPR, 2019. 2 arXiv:2112.10741, 2021. 2, 3
[10] Shin-I Cheng, Yu-Jie Chen, Wei-Chen Chiu, Hsin-Ying Lee, [27] Alexander Quinn Nichol and Prafulla Dhariwal. Improved
and Hung-Yu Tseng. Adaptively-realistic image generation denoising diffusion probabilistic models. In ICML, 2021. 3
from stroke and sketch with diffusion model. In ACCV, 2022. [28] Michael Niemeyer and Andreas Geiger. Giraffe: Represent-
3 ing scenes as compositional generative neural feature fields.
[11] Zezhou Cheng, Menglei Chai, Jian Ren, Hsin-Ying Lee, In CVPR, 2021. 2
Kyle Olszewski, Zeng Huang, Subhransu Maji, and Sergey [29] Aaron van den Oord, Oriol Vinyals, and Koray
Tulyakov. Cross-modal 3d shape generation and manipula- Kavukcuoglu. Neural discrete representation learning.
tion. In ECCV, 2022. 2 In NeurIPS, 2017. 3
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [30] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Toutanova. Bert: Pre-training of deep bidirectional trans- Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
formers for language understanding. In NAACL, 2019. 4 tinuous signed distance functions for shape representation.
[13] Prafulla Dhariwal and Alexander Nichol. Diffusion models In CVPR, 2019. 2
beat gans on image synthesis. NeurIPS, 2021. 3 [31] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-
[14] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv
Stylenerf: A style-based 3d-aware generator for high- preprint arXiv:2209.14988, 2022. 2, 4
resolution image synthesis. In ICLR, 2022. 2 [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
sion probabilistic models. NeurIPS, 2020. 3 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
[16] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, ing transferable visual models from natural language super-
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion vision. In ICML, 2021. 4

9
[33] Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, [48] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir
Shweta Mahajan, and Leonid Sigal. Make-a-story: Visual Mech, and Ulrich Neumann. Disn: Deep implicit surface
memory conditioned consistent story generation. In CVPR, network for high-quality single-view 3d reconstruction. In
2022. 3 NeurIPS, 2019. 5
[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [49] Yinghao Xu, Menglei Chai, Zifan Shi, Sida Peng, Ivan Sko-
Patrick Esser, and Björn Ommer. High-resolution image syn- rokhodov, Aliaksandr Siarohin, Ceyuan Yang, Yujun Shen,
thesis with latent diffusion models. In CVPR, 2022. 2, 3, 4, Hsin-Ying Lee, Bolei Zhou, et al. Discoscene: Spatially
7 disentangled generative radiance fields for controllable 3d-
[35] Chitwan Saharia, William Chan, Saurabh Saxena, Lala aware scene synthesis. In CVPR, 2023. 2
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed [50] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, ume rendering of neural implicit surfaces. NeurIPS, 2021.
Rapha Gontijo Lopes, et al. Photorealistic text-to-image 4
diffusion models with deep language understanding. arXiv [51] Song-Hai Zhang, Yuan-Chen Guo, and Qing-Wen Gu.
preprint arXiv:2205.11487, 2022. 2, 3 Sketch2model: View-aware 3d modeling from single free-
[36] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- hand sketches. In CVPR, 2021. 2
mans, David J Fleet, and Mohammad Norouzi. Image super- [52] X. Zhao, F. Ma, D. Güera, Z. Ren, A. G. Schwing, and A.
resolution via iterative refinement. IEEE TPAMI, 2022. 3 Colburn. Generative Multiplane Images: Making a 2D GAN
[37] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas 3D-Aware. In ECCV, 2022. 2
Geiger. Graf: Generative radiance fields for 3d-aware image [53] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape genera-
synthesis. In NeurIPS, 2020. 2 tion and completion through point-voxel diffusion. In ICCV,
[38] Pratheba Selvaraju, Mohamed Nabail, Marios Loizou, Maria 2021. 3
Maslioukova, Melinos Averkiou, Andreas Andreou, Sid-
dhartha Chaudhuri, and Evangelos Kalogerakis. Build-
ingnet: Learning to label 3d buildings. In ICCV, 2021. 2,
5
[39] Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov,
Kyle Olszewski, Jian Ren, Hsin-Ying Lee, Menglei Chai,
and Sergey Tulyakov. Unsupervised volumetric animation.
In CVPR, 2023. 2
[40] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano
Ermon. D2c: Diffusion-decoding models for few-shot con-
ditional generation. NeurIPS, 2021. 2, 3
[41] Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian
Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d
generation on imagenet. In ICLR, 2023. 2
[42] Edward J Smith and David Meger. Improved adversarial sys-
tems for 3d object generation and reconstruction. In CoRL,
2017. 2
[43] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong
Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum,
and William T Freeman. Pix3d: Dataset and methods for
single-image 3d shape modeling. In CVPR, 2018. 2, 7
[44] Maxim Tatarchenko*, Stephan R. Richter*, René Ranftl,
Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
single-view 3d reconstruction networks learn? In CVPR,
2019. 7
[45] Rundi Wu, Xuelin Chen, Yixin Zhuang, and Baoquan Chen.
Multimodal shape completion via conditional generative ad-
versarial networks. In ECCV, 2020. 5, 6
[46] Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen
Zhou, and Shengping Zhang. Pix2vox: Context-aware 3d re-
construction from single and multi-view images. In CVPR,
pages 2690–2698, 2019. 5, 7
[47] Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang,
Song-Chun Zhu, and Ying Nian Wu. Learning descriptor
networks for 3d shape synthesis and analysis. In CVPR,
2018. 2

10

You might also like