KEMBAR78
Zero-Shot ID Video Generation | PDF | Computers
0% found this document useful (0 votes)
86 views16 pages

Zero-Shot ID Video Generation

Uploaded by

tai dam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views16 pages

Zero-Shot ID Video Generation

Uploaded by

tai dam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

ID-Animator: Zero-Shot Identity-Preserving Human

Video Generation

Xuanhua He1,2∗, Quande Liu3†, Shengju Qian3 , Xin Wang3 ,


Tao Hu1,2 , Ke Cao1,2 , Keyu Yan1,2 , Jie Zhang2†
1
University of Science and Technology of China
2
arXiv:2404.15275v3 [cs.CV] 25 Jun 2024

Hefei Institute of Physical Science, Chinese Academy of Sciences


3
LightSpeed Studios, Tencent
https://id-animator.github.io/

Figure 1: Given simply one facial image, our ID-Animator is able to produce a wide range of
personalized videos that not only preserve the identity of input image, but further align with the given
text prompt, all within a single forward pass without further tuning.

Abstract
Generating high-fidelity human video with specified identities has attracted signifi-
cant attentions in the content generation community. However, existing techniques
struggle to strike a balance between training efficiency and identity preservation,
either requiring tedious case-by-case fine-tuning or usually missing identity details
in the video generation process. In this study, we present ID-Animator, a zero-shot
human-video generation approach that can perform personalized video generation
given a single reference facial image without further training. ID-Animator inherits
existing diffusion-based video generation backbones with a face adapter to encode
the ID-relevant embeddings from learnable facial latent queries. To facilitate the
extraction of identity information in video generation, we introduce an ID-oriented
dataset construction pipeline that incorporates unified human attributes and action
∗ †
Intern in Tencent Corresponding authors

Preprint. Under review.


captioning techniques from a constructed facial image pool. Based on this pipeline,
a random reference training strategy is further devised to precisely capture the
ID-relevant embeddings with an ID-preserving loss, thus improving the fidelity and
generalization capacity of our model for ID-specific video generation. Extensive
experiments demonstrate the superiority of ID-Animator to generate personalized
human videos over previous models. Moreover, our method is highly compatible
with popular pre-trained T2V models like animatediff and various community
backbone models, showing high extendability in real-world applications for video
generation where identity preservation is highly desired. Our codes and checkpoints
are released.

1 Introduction
Personalized or customized generation is to create visual content consistent in style, subject, or
character ID based on one or more reference images. In the realm of image generation, considerable
strides have been made in crafting this identity-specific content, particularly in the domain of human
image synthesis [24, 32, 40, 34]. Recently, text-driven video generation [12, 33, 36] has gathered
substantial interest within the research community. These methods enable the creation of videos
based on user-specified textual prompts. However, the quest for generating high-fidelity, identity-
specific human videos remains to be explored. The generation of identity-specific human videos
holds profound significance, particularly within the film, advertisement and game industries, etc,
where usually a representative character is required to appear. Previous approaches for customization
primarily emphasized specified postures [18], styles [27], and action sequences [39], often employing
additional control to ensure that the generated videos met user requirements. However, these methods
largely overlook the controllability on identity information. Some techniques involved model fine-
tuning through methods like LoRA [17] and textural inversion [8] to achieve ID-specific control [28],
but requiring tedious case-by-case training for each ID thus lacking real-time inference ability. Others
relied on image prompts to guide the model to feature particular subjects in generating videos, yet
encountering challenges such as intricate dataset pipeline construction and limited ID variations [22].
Furthermore, the direct integration of image customization modules [43] into the video generation
model resulted in poor quality, such as static motion or severe frame inconsistency.
In summary, the field of ID-specified video generation currently confronts several notable challenges:

1. High training and fine-tuning costs: Many ID-specified customization methods requires
tedious case-by-case fine-tuning costs due to the the lack of enough prior knowledge,
consequently imposing significant training overheads at inference time. These training costs
hinder the widespread adoption and scalability of ID-specified video generation techniques.
2. Scarcity of high-quality text-conditioned human video datasets: Unlike the image
generation community, where datasets like LAION-face [46] are readily available, the video
generation community lacks sufficient high-quality text-video data pairs, especially for
human videos. Existing datasets, such as CelebV-text [44], feature captions annotated with
fixed templates that concentrate on emotion changes while ignoring human attributes and
actions, making them unsuitable for ID-preserving video generation tasks. This scarcity
hampers research progress, leading to repetitive endeavors to collect private datasets.
3. Influence of ID-irrelevant features from reference images for video generation: The
existence of ID-irrelevant features in reference images can adversely hurt the quality and
identity preservation of generated videos. How to reduce the influence of such features poses
a challenge, demanding novel solutions to ensure fidelity in ID-specified video generation.

Solutions: To address the above issues, we propose an efficient ID-specific video generation frame-
work, named ID-Animator. With the help of the pre-trained text-to-video diffusion model and a
lightweight face adapter to encode the ID-relevant embeddings from learnable facial latent queries,
our method can largely reduce the training and fine-tuning costs for the first issue, i.e., it can complete
training within one day on a single A100 GPU and perform zero-shot inference once the training
is done. To address the second issue, we build an ID-oriented dataset construction pipeline. By
leveraging existing publicly available datasets, we introduce the concept of unified captions, which
involves generating captions for human actions, human attributes and a unified human description.

2
Additionally, we leverage face-relevant recognition techniques to create corresponding reference
image pool. Trained with the rewritten captions and the extracted facial image pool, our ID-Animator
significantly enhances its video generation quality. In response to the third issue, we devise a novel
random reference training strategy, which randomly samples faces from the face pool and optimize
with an ID-preserving objective to decouple ID-independent content from ID-related facial features.
Through the aforementioned designs, our model can achieve zero-shot ID-specific video generation
in a lightweight manner. It seamlessly integrates into existing community models [5], showcasing
robust generalization and ID preservation capabilities.
Our core contribution can be summarized as follows:

• We propose ID-Animator, a novel framework that can generate identity-specific videos given
any reference facial image without further model fine-tuning. It inherits pre-trained video
diffusion models with a lightweight face adapter to encode the ID-relevant embeddings
from learnable facial latent queries. To the best of our knowledge, this is the first endeavor
towards achieving zero-shot ID-specific human video generation.
• We develop an ID-oriented dataset construction pipeline to mitigate the missing of training
dataset in personalized human video generation. Over publicly available data sources, we
present unified captioning of human videos, which extracts textual descriptions for human
attributes and actions respectively to attain comprehensive human captions. Besides, a facial
image pool is constructed over this dataset to facilitate the extraction of facial embeddings.
• Over this pipeline, we further devise a random reference training strategy optimized with
an ID-preserving objective, aiming to precisely extract the identity-relevant features and
diminish the influence of ID-irrelevant information from the given reference facial image,
thereby improving the identity fidelity and generalization ability of ID-Animator in real-
world applications for personalized human video generation.

2 Related Work
2.1 Video Generation

Video generation has been a key area of interest in research for a long time. Early endeavors
in the task utilized models like generative adversarial networks [16, 4, 20] and vector quantized
variational autoencoder to generate video [23, 10, 29]. However, due to the inherent model ability,
this video lacks motion and details and is unable to achieve good results. With the rise of the diffusion
model [14], notably the latent diffusion model [30] and its success in image generation, researchers
have extended the diffusion model’s applicability to video generation [15, 13, 1]. This technique
can be classified into two parts: image-to-video and text-to-video generation. The former essentially
transforms a given image into a dynamic video, whereas the latter generates video only following
text instructions, without any image as input. Leading-edge methods, exemplified by these works,
include Animate Diffusion [12], Dynamicrafter [38], Modelscope [33], AnimateAnything [6], and
Stable Video [1], among others. These techniques generally exploit pre-trained text-to-image models
and intersperse them with diverse forms of temporal mixing layers.

2.2 ID-preserving Image Generation

The impressive generative abilities of diffusion models have attracted recent research endeavors
investigating their personalized generation potential. Current methods within this domain can be
divided into two categories, based on the necessity of fine-tuning during the testing phase. A subset
of these methods requires the fine-tuning of the diffusion model leveraging ID-specific datasets
during the testing phase, representative techniques such as DreamBooth [31], textual inversion [8],
and LoRA [17]. While these methods exhibit acceptable ID preservation abilities, they necessitate
individual model training for each unique ID, thus posing a significant challenge related to training
costs and dataset collection, subsequently hindering their practical applicability. The latest focus of
research in this domain has shifted towards training-free methods that bypass additional fine-tuning or
inversion processes in testing phase. During the inference phase, it is possible to create a high-quality
ID-preserving image with just a reference image as the condition. Methods like Face0 [32] replace
the final three tokens of text embedding with face embedding within CLIP’s feature space, utilizing

3
this new embedding as conditional for image generation. PhotoMaker [24], on the other hand, takes
a similar approach by stacking multiple images to reduce the influence of ID-irrelevant features.
Similarly, IP-Adapter [43] decoupled reference image features and text features to facilitate cross
attention, resulting in better instruction following. Concurrently, InstantID [34] combined the features
of IP-Adapter and ControlNet [45], utilizing both global structural attributes and the fine-grained
features of reference images for the generation of ID-preserving images.

2.3 Subject-driven Video Generation

Subject-driven video generation aims at generating videos containing customized subjects. Re-
search on subject-driven video generation is still in its early stages, with two notable works being
VideoBooth [22] and MagicMe [28]. VideoBooth [22] strives to generate videos that maintain high
consistency with the input subject by utilizing the subject’s clip feature and latent embedding obtained
through a VAE encoder. This approach offers more fine-grained information than ID-preserving
generation methods; however, its limitation remains the subjects required to be present in the training
data, such as cats, dogs, and vehicles, which results in a restricted range of applicable subjects.
MagicMe [28], on the other hand, is more closely related to the ID-preserving generation task. It
learns ID-related representations by generating unique prompt tokens for each ID. However, this
method requires separate training for each ID, making it unable to achieve zero-shot training-free
capabilities. This limitation poses a challenge for its practical application. Our proposed method
distinguishes itself from these two approaches by being applicable to any human image without
necessitating retraining during inference.

3 Method

Figure 2: An overview of our proposed framework: the ID-Animator; dataset reconstruction pipeline;
random reference training strategy with ID-preserving learning objective.

3.1 Overview

Given a reference ID image, ID-Animator endeavors to produce high-fidelity ID-specific human


videos. Figure 2 demonstrates our methods, featuring three pivotal constituents: a dataset recon-
struction pipeline, the ID-Animator framework, and the random reference decompositional strategy
employed during the training process of ID-Animator.

4
Figure 3: Examples of the original CelebV-caption and our human attribute caption, human action
caption and the unified human caption.

3.2 ID-Animator

As depicted at the bottom of Figure 2, our ID-Animator framework comprises two components: the
backbone text-to-video model, which is compatible with diverse T2V models, and the face adapter,
which is subject to training for efficiency.
Pretrained Text to Video Diffusion Model The pre-trained text-to-video diffusion model exhibits
strong video generation abilities, yet it lacks efficacy in the realm of ID-specific human video
generation. Thus, our objective is to harness the existing capabilities of the T2V model and tailor it to
the ID-specific human video generation domain. Specifically, we employ AnimateDiff [18] as our
foundational T2V model.
Face Adapter The advent of image prompting has substantially bolstered the generative ability of
diffusion models, particularly when the desired content is challenging to describe precisely in text.
IP-Adapter [43] proposed a novel method, enabling image prompting capabilities on par with text
prompts, without necessitating any modification to the original diffusion model. Our approach mirrors
the decoupling of image and text features in cross-attention. This procedure can be mathematically
expressed as:
Znew = Attention(Q, K t , V t ) + λ · Attention(Q, K i , V i ) (1)
where Q, K t , and V t denote the query, key, and value matrices for text cross-attention, respectively,
while K i and V i correspond to image cross-attention. Provided the query features Z and the image
features ci , Q = ZWq , K i = ci Wki , and V i = ci Wvi . Only Wki and Wvi are trainable weights.
Inspired by IP-Adapter, we limit our modifications to the cross-attention layer in the video generation
model, leaving the temporal attention layer unchanged to preserve the original generative capacity
of the model. A lightweight face adapter module is designed, encompassing a handful of simple
query-based image encoder and the cross-attention module with trainable cross-attention projection
weights, as shown in Figure 2. The image feature ci is derived from the clip feature of the reference
image, and is further refined by the query-based image encoder. The other weights in cross attention
i
module are initialized from the original diffusion model, of which the projection weights WK and
i
WV are initialized using the weights of the IP-Adapter, facilitating the acquisition of preliminary
image prompting capabilities and reducing the overall training costs.

3.3 ID-Oriented Human Dataset Reconstruction

Contrary to identity-preserving image generation tasks, video generation tasks currently suffer from
a lack of identity-oriented datasets. The dataset most relevant to our work is the CelebV-HQ [44]
dataset, comprising 35,666 video clips that encompass 15,653 identities and 83 manually labeled
facial attributes covering appearance, action, and emotion. However, their captions are derived from
manually set templates, primarily focusing on facial appearance and human emotion while neglecting
the comprehensive environment, human action, and detailed attributes of video. Additionally, its
style significantly deviates from the user instructions, rendering it unsuitable for contemporary video
generation models. Consequently, we find it necessary to reconstruct this dataset into an identity-
oriented human dataset. Our pipeline incorporates a unified caption technique and the construction of
a face image pool.

5
Unified Human Video Caption Generation. We design a comprehensive restructuring of the
captions within the CelebV-HQ dataset to enhance human video geneartion for ID-Animator. To
produce high-quality human videos, it is crucial to comprehensively caption the semantic information
and intricate details present within the video. Consequently, the caption must incorporate detailed
attributes of the individual as well as the actions they are performing in the video. In light of this, we
employ a novel rewriting technique that decouples the caption into two distinct components: human
attributes and human actions. Subsequently, we leverage a language model to amalgamate these
elements into a cohesive and comprehensive caption, as illustrated at the right of Figure 2.
• Human Attribute Caption. As a preliminary step, we focus on crafting an attribute caption that
aims to vividly depict the individual’s appearance and the surrounding context. To achieve this,
we employ the ShareGPT4V [2] model for caption generation. We choose the median frame of
the video as the input for ShareGPT4V. This approach allows us to generate detailed character
descriptions that incorporate a wealth of attribute information.
• Human Action Caption. Our objective is to create human videos with accurate and rich motions,
where a mere human attribute caption is insufficient for our needs. To address this requirement,
we introduce the concept of a human action caption, which strives to depict the action present
within the video. These captions are specifically designed to concentrate on the semantic content
across the entire video, facilitating a comprehensive understanding of the individual’s actions
captured therein. To achieve this goal, we leverage the Video-LLava [26] model, which has been
trained on video data and excels at focusing on the overall dynamism.
• Unified Human Caption. The limitations of relying solely on human attribute captions and
human action captions are demonstrated in Figure 3. Human attribute caption fails to encompass
the overall action of the individual, while human action caption neglects the detailed characteristics
of the subject. To address this, we designed a unified human caption that amalgamates the benefits
of both caption types, using this comprehensive caption to train our model. We employ a large
language model to facilitate this integration, capitalizing on its capacity for human-like expression
and its capacities in generating high-quality captions. The GPT-3.5 API is utilized in this process.
As depicted in Figure 3, the rewritten caption effectively encapsulates the video scene, aligning
more closely with human instructions.
Random Face Extraction for Face Pool Construction. In contrast to previous methods [43, 28, 22],
our approach does not directly utilize a frame from video as a reference image. Instead, we opt
to extract the facial region from video, using this as the reference image. Simultaneously, our
technique differs from the image reconstruction training strategy employed in the ID-preserving
image generation works [34, 32, 24], which typically reconstructs a reference image using the same
image as condition. As depicted at the bottom of Figure 2, we employ shuffling on video sequences
and extract facial regions from five randomly selected frames. In instances where a frame contains
more than one face, it is discarded and additional frames are selected for re-extraction. The extracted
facial images are subsequently stored in the face pool.

3.4 Random Reference Training Methods For Diminishing ID-Irrelevant Features

In the training phase of the ID-preserving diffusion model, the objective is to estimate the noise ϵ
at the current time step t from a noisy latent representation zt , incorporating conditions such as a
text condition C and an image condition Ci . This noisy latent zt is derived from the clean latent z
combined with the noise component associated with the current time step t, i.e., zt = f (z, t). This
optimization procedure can be expressed by the following function:
Ldif f = Ezt ,t,C,Ci ,ϵ N (0,1) [||ϵ − ϵθ (zt , t, C, Ci )||22 ] (2)
In current identity-preserving image generation models, the image condition Ci and the reconstruction
target z typically originate from the same image I. For instance, Face0 [32], InstantID [34], and
FaceStudio [40] utilize image I as the target latent Z, with the facial region of I serving as Ci .
Conversely, Anydoor [3], and IP-Adapter directly employ the feature of image I as Ci . In the learning
phase of image reconstruction, this approach provides overly strong conditions for the diffusion
model, which not only concentrates on facial features but also encompasses extraneous features such
as the background and angles. This may result in the neglect of domain-invariant identity features.
Random Reference Training Strategy. By directly applying the above method to video generation
tasks, this strong conditioning can cause the video content to become heavily dependent on the

6
semantic information of the reference image rather than focusing on its facial embedding. Character
identity should exhibit domain invariance, implying that given images of the same individual from
various angles and attire, video generation outcomes should be similar. Therefore, drawing inspiration
from the Monte Carlo concept, we designed a random reference training methodology. This approach
employs weakly correlated images with the current video sequence as the condition Cj , effectively
decoupling the generated content from the reference images. Specifically, during training, we
randomly select a reference image from the previously extracted face pool, as depicted in Figure 2.
By employing this Monte Carlo technique, the features from diverse reference images are averaged,
reducing the influence of identity-invariant features. This transformation of the mapping from
(C, Ci )− > z to (C, Cj )− > z diminishes the impact of extraneous features.
Optimization with ID-Preserving Loss. In addition to the traditional diffusion loss function, we
employ a reward function in our framework to promote identity-preserving learning. During the
training process, we denoise the latent variables zt , and after reaching a specific time step t̂, the
denoised latent z0 is directly predicted from zt̂ . Subsequently, we pass z0 through the VAE decoder
to obtain X 0 . To measure the similarity between the reference face image and the generated content,
we employ a face detection model to extract the face region from the generated content and use a
face encoder to compute their similarity. This similarity is then used as a reward to update the model.
Given the face region of X 0 as X f and the condition image as Cj , the reward can be expressed as:
Rid (Cj , X f ) = cosSim(ϕ(Cj ), ϕ(X f )) (3)
where cosSim(.) represents the cosine similarity operator, and ϕ(.) denotes the face encoder. By
incorporating this identity-preserving reward, our approach can achieve superior identity-preserving
capabilities. The overall optimization objective can be formulated as:
L = Ldif f + λ(1 − Rid ) (4)
where 1 − Rid is the identity-preserving loss and λ is the hyper-parameter to balance the two losses.

4 Experiment
4.1 Implementation details

ID-Animator is compatible with various T2V generation backbones and here we employ the
commonly-used AnimateDiff architecture [12] for experiments. Our training dataset is processed
by clipping to 16 frames, center cropping, and resizing to 512x512 pixels. During training, only the
parameters of the face adapter are updated, while the pre-trained text-to-video model remains frozen.
Our experiments are carried out on a single NVIDIA A100 GPU (80GB) with a batch size of 2. We
load the pretrained weights of IP-Adapter and set the learning rate to 5e-5 for our trainable adapter.
Furthermore, to enhance the generation performance using classifier-free guidance, we applied a
20% probability of utilizing null-text embeddings to replace the original updated text embedding.
Our training process is comprised of two stages. In the initial stage, we employ both the identity
loss and diffusion loss to train the model, sampling only a single frame per video while setting the
hyper-parameter λ to 1. In the subsequent stage, we continue the training from the model obtained in
the first stage, only utilized diffusion loss on 16-frame videos. We trained our model for 1 epoch in
the first stage and for 8 epochs in the second stage.

4.2 Dataset and Evaluation Metrics

We utilize a subset of the CelebV-text dataset as our training dataset, comprising 15k videos, and
construct our identity-oriented dataset based on this foundation. Following the filtering of videos
containing multiple faces, the final dataset employed for training contains 13k videos. To assess
the performance of our methods, we opt for a comparative study with the IP-Adapter [43] and
AnimateDiff. To demonstrate the generalization ability of our methods, we utilize unsplash50 [9]
dataset as the testset and use prompts generated by GPT-4. We report the CLIP-I score, which
measures the face structural similarity between the reference image and the generated video; the
Dover score [35], representing the overall quality of the generated video from both technical and
aesthetic perspectives; the Motion Score [25], assessing the diversity and variability of motion;
the dynamic degree [21], indicating the probability of generating motion-rich videos; and the Face

7
Similarity [7], which serves to measure facial feature similarity. We employ the CLIP encoder, which
is pretrained on the LAION-Face dataset and is skilled at capturing facial structures. Given that CLIP
and face encoder are trained on images, we calculate the CLIP-I, and face similarity for each frame
and report the average score. The former three metrics are leveraged to evaluate the video quality,
while the latter two serve to evaluate the identity preserving ability.

Table 1: Quantitative comparison between the state-of-the-art methods. The best results are high-
lighted in bold, while the second-best results results are underlined.
Video Quality Identity Preservation
Method
Dover Score↑ Motion Score↑ Dynamic Degree↑ CLIP-I↑ Face Similarity↑
AnimateDiff [12] 0.644 6.341 0.380 - -
IP Adapter Plus Face [42] 0.645 3.799 0.197 0.805 0.266
IP Adapter Face-ID Portrait [41] 0.588 5.480 0.217 0.587 0.411
Ours 0.714 8.008 0.539 0.850 0.316

4.3 Quantitative Comparison

The results of the quantitative experiment are presented in Table 1. Our method surpassed the
state-of-the-art methods across four metrics: CLIP-I, Dover Score, Motion Score, and Dynamic
degree, achieving superior results. The IP Adapter Face ID Portrait employs arcface face embedding
instead of clip embedding, which enhances the facial feature similarity but hinders the facial structural
similarity. A direct application of the IP Adapter to video models results in diminished video quality
and relatively static motion, as evidenced by the dover score, motion score and the dynamic degree.
Additionally, our methods also surpassed the AnimateDiff on human related prompts, demonstrating
our superior human related video generation ability.

Figure 4: Comparison between our methods and previous methods on three ordinary individuals.

4.4 Qualitative Comparison

We choose three images of ordinary individuals as test cases, with the images being sampled from
unseen testing set. We randomly generated three prompts from LLM, maintaining consistency with
human language style. As depicted in Figure 4, it is shown that our results can better preserve the
identity information of the given reference image over other methods, no matter for man, woman or
kids, showing the identity fidelity and generalization capacity of our method. In contrast, the face
generated by IP-Adapter-Plus-Face shows a certain level of deformation, whereas the IP-Adapter-
FaceID-Portrait model is deficient in facial structural information, resulting in a diminished similarity
between the generated outputs and the reference image.

8
Figure 5: From top to bottom, our model showcases its ability to recontextualize various elements in
an reference image, including human hair, clothing, background, actions, age, and gender.

4.5 Applications

In this section, we showcase the potential applications of our model, encompassing recontextualiza-
tion, alteration of age or gender, ID mixing, and integration with ControlNet or community models [5]
to generate highly customized videos.

4.5.1 Recontextualization

Given a reference image, our model is capable of generating ID fidelity videos and changing contextual
information. The contextual information of characters can be tailored through text, encompassing
attributes such as features, hair, clothing, creating novel character backgrounds, and enabling them
to execute specific actions. As illustrated in Figure 5, we supply reference images and text, and the
outcomes exhibit the robust editing and instruction-following capacities of our model.
As depicted in the figure 5, from top to bottom, we exhibit the model’s proficiency in altering character
hair, clothes, background, executing particular actions, and changing age or gender.

4.5.2 Identity Mixing

The potential of our model to amalgamate different IDs is showcased in the figure 6. Through the
blending of embeddings from two distinct IDs in varying proportions, we have effectively combined

9
Figure 6: The figure illustrates our model’s capability to blend distinct identities and create identity-
specific videos.

features from both IDs in the generated video. This experiment substantiates the proficiency of our
face adapter in learning facial representations.

4.5.3 Combination with ControlNet

Furthermore, our model demonstrates excellent compatibility with existing fine-grained condition
modules, such as ControlNet [45]. We opted for SparseControlNet [11], trained for AnimateDiff,
as an additional condition to integrate with our model. As illustrated in Figure 7, we can supply
either single frame control images or multi-frame control images. When a single frame control image
is provided, the generated result adeptly fuses the control image with the face reference image. In
cases where multiple control images are presented, the generated video sequence closely adheres to
the sequence provided by the multiple images. This experiment highlights the robust generalization
capabilities of our method, which can be seamlessly integrated with existing models.

4.5.4 Inference with Community Models

We assessed the performance of our model using the Civitai community model, and our model
continues to function effectively with these weights, despite never having been trained on them.
The selected models include Lyriel and Raemumxi. As depicted in Figure 8, the first row presents
the results obtained with the Lyriel model, while the second row showcases the outcomes achieved
using the Raemuxi model. Our method consistently exhibits reliable facial preservation and motion
generation capabilities.

10
Figure 7: Our model can combine with ControlNet to generate ID-specific videos.

Figure 8: From the top to bottom, we visualize the inference results with Lyriel and Raemuxi model
weights.

4.6 Ablation Experiment

To investigate the efficacy of our proposed method, we carry out ablation experiments for ID-
Animator. We first study the effectiveness of the unified captioning technique, the random reference
training strategy, as well as the ID-preserving objective. In addition, we also demonstrate the
recontextualization and identity mixing ability of our model. Following the literature [19], we random
sample a subset from the unslash50 dataset for ablation study due to computation resources issue.
On the Effectiveness of ID Preserving Loss: In the first part of ablation experiments, we remove
the ID-preserving objective. As demonstrated by the Table 3, the removal of the ID-preserving loss
leads to a decrease in the model’s facial feature similarity and the CLIP-I metrics, validating the
effectiveness of ID-preserving objective. Furthermore, the motion score and dynamic degree suffer
from a notable degradation.
On the Effectiveness of Random Reference Training: In the second set of ablation experiments, we
remove the random reference training strategy and utilize a fixed reference image as the facial region
in the first frame of the video. As shown in the fourth row of Table 3, upon removal of the random
reference training strategy, we observe a decline in video quality and the metrics of CLIP-I. Without
random reference training strategy, the model is inevitably confused by the irrelevant semantic
information in the reference image while neglecting the identity-relevant embeddings, leading to
inferior performance for generated human videos.
On the Effectiveness of Unified Captioning on Human Videos: In the third ablation study, we
utilized the original captions from the CelebV dataset for model training. Due to the limited diversity
in the available captions, we noticed a significant decrease in both motion and dover scores of the
generated videos. Without the newly-provided unified captions, the model struggles to produce
motion-enriched videos given human-like style text inputs, resulting in poor video quality.

11
4.7 Prompt List for Evaluation

In order to assess our model’s video generation capabilities, we follow the previous method [37],
employing four categories of prompts: Accessory, Style, Context, and Action. Distinct from prior
methods, we leveraged GPT4 to rewrite these prompts, enhancing their complexity and aligning
them with human language styles. We utilized a total of 50 prompts, with the Unsplash50 dataset
comprising 50 unique IDs. For each ID, we generated 50 videos using the 50 prompts, resulting in a
total of 2,500 videos for each method. The prompt content, displayed in the Table 2, showcases the
diversity and human-like style.
Table 2: The prompt is used in the evaluation procedure of the adapter model. This prompt is rewritten
by GPT-4 and is highly aligned with human style.
Category Prompt Category Prompt

A {class_token} donning a red hat greets others with a friendly wave. A curious {class_token} in the heart of the jungle examines a map to navigate the dense foliage.

A {class_token} in a Santa hat enjoys a warm cup of hot cocoa during the holiday season. An adventurous {class_token} in the snow constructs a snowman with joy and laughter.

A {class_token} wrapped in a vibrant rainbow scarf snaps a cheerful selfie. A sun-loving {class_token} on the beach diligently applies sunscreen to protect their skin.

A sophisticated {class_token} wearing a black top hat and monocle peruses the daily newspaper. A {class_token} strolls along a cobblestone street, sketching the charming surroundings.

A culinary {class_token} dressed in a chef’s attire expertly seasons a delicious dish. A focused {class_token} sits on pink fabric, skillfully sewing a button onto a garment.
Accessory Cotenxt
A brave {class_token} in a firefighter uniform takes charge of a hose to combat a blaze. A patient {class_token} on a wooden floor assembles a puzzle, piece by piece.

A vigilant {class_token} in a police outfit communicates with colleagues over a radio. A busy {class_token} with a cityscape in the background hails a taxi to their next destination.

A stylish {class_token} sporting pink glasses browses the latest trends on a tablet. A nature-loving {class_token} with a majestic mountain in the background takes a deep, refreshing breath.

A creative {class_token} wearing a bright yellow shirt skillfully paints a masterpiece. A {class_token} with a quaint blue house in the background tends to their vibrant garden.

A mystical {class_token} adorned in a purple wizard outfit casts a spell with a flourish. A reflective {class_token} on a purple rug in a serene forest writes their thoughts in a journal.

A thought-provoking painting of a {class_token} in Banksy’s street art style, cleverly spray painting a wall. A confident {class_token} riding a horse adjusts their hat while maintaining control.

A {class_token} captured in the vivid style of Vincent Van Gogh, gently picking flowers from a field. A sociable {class_token} holding a glass of wine raises a toast to celebrate with friends.

A lively graffiti painting of a {class_token} passionately strumming a guitar. A birthday-celebrating {class_token} holds a piece of cake and blows out the candles with a wish.

A serene watercolor painting of a {class_token} gracefully holding an umbrella in the rain. An intellectual {class_token} giving a lecture adjusts their glasses for a clearer view.

A Greek marble sculpture of a {class_token} reflecting on their own beauty. A studious {class_token} reading a book turns a page, eager to continue the story.
Style Action
A captivating street art mural of a {class_token} taking a photo of a bustling city scene. A green-thumbed {class_token} tends to their backyard garden, pruning plants with care.

A nostalgic black and white photograph of a {class_token} lighting a cigarette in a quiet moment. A {class_token} cooking a meal stirs a pot, ensuring the flavors meld together perfectly.

A pointillism painting of a {class_token} playfully interacting with a delicate butterfly. A determined {class_token} at the gym wipes their brow after an intense workout.

A traditional Japanese woodblock print of a {class_token} pouring tea with elegance. A responsible {class_token} walks their dog, holding the leash to ensure their pet’s safety.

A bold street art stencil of a {class_token} writing a powerful message for all to see. A {class_token} baking cookies takes a taste of the dough, ensuring it’s just right.

Table 3: Ablation experiment results on unsplash50 dataset. The best results are highlighted in bold,
while the second-best results results are underlined.
Video Quality Identity Preservation
Config ID Loss Random Reference Training Unified Captioning
Dover Score↑ Motion Score↑ Dynamic Degree↑ CLIP-I↑ Face Similarity↑
(I) x ✓ ✓ 0.730 5.694 0.324 0.750 0.292
(II) ✓ x ✓ 0.700 5.318 0.182 0.760 0.321
(III) ✓ ✓ x 0.679 4.158 0.158 0.740 0.300
Ours ✓ ✓ ✓ 0.739 7.797 0.507 0.768 0.315

Recontextualization and Identity Mixing Capacity: Apart from the traditional video generation
ability, we further validate the recontextualization and the identity mixing ability of our method. As
illustrated in the Figure 9, we observe that our method is able to manipulate the facial attributes (e.g.,
hair style and color) of given identity via text condition, showing the recontextualization capacity.
Moreover, when given multiple facial images from different person as input, our model is further
able to generate videos that seamlessly mixes these individual facial identities. These results show
the generalization capacity and extendability of our method in real-world applications.
Visual Results of Ablation Experiment Ee showcase additional experimental results, encompassing
visualizations of ablation studies and comparisons with IP adapters, as depicted in the Figure 10.
In the left column of the figure, it is evident that the IP adapter Plus Face produces an excessively
small face, leading to a blurry and unrecognizable generated output. Meanwhile, the IP adapter
FaceID portrait lacks the necessary facial structural information, resulting in a substantial discrepancy
between the generated output and a real person’s appearance. When contrasted with the ablation

12
Figure 9: Demonstration for the recontextualization and identity mixing ability of our methods.

Figure 10: The figure demonstrates the effectiveness of our proposed methods.

study, we observe that the absence of ID loss contributes to a decline in facial similarity, while
the lack of random reference training causes the generated outputs to be nearly static. From the
middle column of the image, it becomes apparent that faces generated by other methods appear
obese, consequently compromising the original facial structure information. The third column of the
image presents analogous results. Theses visual results demonstrate the effectiveness of our methods
which combining the random reference training methods, the dataset reconstruction pipeline and the
ID-preserving loss function.

5 Conclusion

In this research, our primary goal is to achieve ID-specific content generation in text-to-video (T2V)
models. To this end, we introduce a ID-Animator framework to drive T2V models in generating ID-
specific human videos using ID images. We facilitate the training of our ID-Animator by constructing
an ID-oriented dataset based on publicly available resources, incorporating unified caption generation
and face pool construction. Moreover, we develop a random reference training method to minimize
ID-irrelevant content in reference images and utilized an ID preservation loss function to encourage

13
ID preservation learning, thereby directing the adapter’s focus towards ID-related features. Our
extensive experiments demonstrate that our ID-Animator generates stable videos with superior ID
fidelity compared to previous models.

References
[1] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do-
minik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion:
Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
[2] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua
Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint
arXiv:2311.12793, 2023.
[3] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor:
Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023.
[4] Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixé, and Nils Thuerey. Learning temporal
coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics
(TOG), 39(4):75–1, 2020.
[5] Civitai. Civitai. https://civitai.com/. Accessed: April 21, 2024.
[6] Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang.
Animateanything: Fine-grained open domain image animation with motion guidance. arXiv
e-prints, pages arXiv–2311, 2023.
[7] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular
margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4690–4699, 2019.
[8] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and
Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using
textual inversion. arXiv preprint arXiv:2208.01618, 2022.
[9] Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H Bermano, Gal Chechik, and
Daniel Cohen-Or. Lcm-lookahead for encoder-based text-to-image personalization. arXiv
preprint arXiv:2404.03620, 2024.
[10] Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and
Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer.
In European Conference on Computer Vision, pages 102–118. Springer, 2022.
[11] Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl:
Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933,
2023.
[12] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani-
matediff: Animate your personalized text-to-image diffusion models without specific tuning.
arXiv preprint arXiv:2307.04725, 2023.
[13] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko,
Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High
definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
[14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances
in neural information processing systems, 33:6840–6851, 2020.
[15] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and
David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems,
35:8633–8646, 2022.
[16] Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial
network for talking head video generation. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 3397–3406, 2022.
[17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021.

14
[18] Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone:
Consistent and controllable image-to-video synthesis for character animation. arXiv preprint
arXiv:2311.17117, 2023.
[19] Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion
models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024.
[20] Yaosi Hu, Chong Luo, and Zhenzhong Chen. Make it move: controllable image-to-video
generation with text descriptions. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 18219–18228, 2022.
[21] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang,
Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark
suite for video generative models. arXiv preprint arXiv:2311.17982, 2023.
[22] Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change
Loy, and Ziwei Liu. Videobooth: Diffusion-based video generation with image prompts. arXiv
preprint arXiv:2312.00777, 2023.
[23] Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, and Ziwei Liu.
Text2performer: Text-driven human video generation. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, pages 22747–22757, 2023.
[24] Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan.
Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint
arXiv:2312.04461, 2023.
[25] Zhi Li, Christos Bampis, Julie Novak, Anne Aaron, Kyle Swanson, Anush Moorthy, and
JD Cock. Vmaf: The journey continues. Netflix Technology Blog, 25(1), 2018.
[26] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united
visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
[27] Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Xintao Wang, Yujiu Yang,
and Ying Shan. Stylecrafter: Enhancing stylized text-to-video generation with style adapter.
arXiv preprint arXiv:2312.00330, 2023.
[28] Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong,
Kurt Keutzer, and Jiashi Feng. Magic-me: Identity-specific video customized diffusion. arXiv
preprint arXiv:2402.09368, 2024.
[29] Shengju Qian, Huiwen Chang, Yuanzhen Li, Zizhao Zhang, Jiaya Jia, and Han Zhang.
Strait: Non-autoregressive generation with stratified image transformer. arXiv preprint
arXiv:2303.00750, 2023.
[30] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[31] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman.
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
22500–22510, 2023.
[32] Dani Valevski, Danny Lumen, Yossi Matias, and Yaniv Leviathan. Face0: Instantaneously
conditioning a text-to-image model on a face. In SIGGRAPH Asia 2023 Conference Papers,
pages 1–10, 2023.
[33] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang.
Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
[34] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot
identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
[35] Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun,
Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from
aesthetic and technical perspectives. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 20144–20154, 2023.

15
[36] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne
Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image
diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7623–7633, 2023.
[37] Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcom-
poser: Tuning-free multi-subject image generation with localized attention. arXiv preprint
arXiv:2305.10431, 2023.
[38] Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and
Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv
preprint arXiv:2310.12190, 2023.
[39] Zhe Xu, Kun Wei, Xu Yang, and Cheng Deng. Do you guys want to dance: Zero-shot
compositional human dance generation with multiple persons. arXiv preprint arXiv:2401.13363,
2024.
[40] Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, and Bin
Fu. Facestudio: Put your face everywhere in seconds. arXiv preprint arXiv:2312.02663, 2023.
[41] Hu Ye. IP-Adapter FaceID Portrait V11 SD15. https://huggingface.co/h94/
IP-Adapter-FaceID/blob/main/ip-adapter-faceid-portrait-v11_sd15.bin,
2024. Accessed on: 2024-04-19.
[42] Hu Ye. IP-Adapter Plus Face. https://huggingface.co/h94/IP-Adapter/blob/main/
models/ip-adapter-plus-face_sd15.bin, 2024. Accessed on: 2024-04-19.
[43] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image
prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
[44] Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu. Celebv-
text: A large-scale facial text-video dataset. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 14805–14814, 2023.
[45] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image
diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 3836–3847, 2023.
[46] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan,
Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-
linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 18697–18709, 2022.

16

You might also like