Resources:
- Audio2Face-3D Example Dataset: https://huggingface.co/datasets/nvidia/Audio2Face-3D-Dataset-v1.0.0-claire
- Maya-ACE plugin: https://github.com/NVIDIA/Maya-ACE
- Research Paper: https://arxiv.org/abs/2508.16401
Audio2Face-3D generates high-fidelity facial animations from an audio source. The technology is capable of producing detailed and realistic articulation, including precise motion for the skin, jaw, tongue, and eyes, to achieve accurate lip-sync and lifelike character expression, including emotions.
Audio2Face-3D Training Framework is the core tool for training high-fidelity facial animation models within the Audio2Face-3D ecosystem. It supports both NVIDIA's prebuilt models and custom models tailored to specific characters, languages, or artistic styles. Training these models requires extensive datasets of synchronized facial animation and corresponding audio, which the framework is designed to leverage efficiently.
- Introduction
- Preparing Animation Data for Training
- Training Framework
- Configurations Guide
- Using Trained Models in Maya-ACE 2.0
- Operating System: Linux or WSL2 (Ubuntu 22.04 recommended)
- Storage: ~1 GB of free space for framework artifacts and the example dataset
- Hardware: CUDA-compatible GPU with at least 6 GB VRAM
- NVIDIA Driver: Use the following supported range:
- Linux: 575.57 - 579.x
- Windows/WSL2: 576.57 - 579.x
- Check your current version:
nvidia-smi
- Docker: Required for running the framework
- NVIDIA Docker: Required for GPU acceleration
This quick start guide provides a comprehensive walkthrough of the Audio2Face-3D Training Framework.
Using a sample dataset available from Hugging Face, you will learn the complete end-to-end workflow, from initial setup to testing a newly trained model.
In this guide, you will learn to:
- Set up the Training Framework environment.
- Train a new model using the sample data.
- Deploy the trained model into a usable format.
- Test the new model by running an inference.
Note: If you are not familiar with Linux and are working on a Windows system, please refer to the Detailed Setup Under Windows (WSL2 / Ubuntu) section in the Training Framework page.
Clone the Audio2Face-3D Training Framework repository:
# Create audio2face directory and navigate to it
mkdir -p ~/audio2face && cd ~/audio2face
# Clone the repository
git clone https://github.com/NVIDIA/Audio2Face-3D-Training-Framework.git
Create new directories to hold datasets and training files:
# Create datasets and workspace directories
mkdir -p ~/audio2face/datasets
mkdir -p ~/audio2face/workspace
# Navigate to the repository directory
cd ~/audio2face/Audio2Face-3D-Training-Framework
# Copy environment file template
cp .env.example .env
Edit the .env
file with your actual paths (use absolute paths):
A2F_DATASETS_ROOT="/home/<username>/audio2face/datasets"
A2F_WORKSPACE_ROOT="/home/<username>/audio2face/workspace"
We provide the Audio2Face-3D Example Dataset as part of this framework.
- Download the dataset:
- You can download the Claire dataset from: Claire Dataset on Hugging Face
- It needs to be placed under the
A2F_DATASETS_ROOT
directory as defined in the environment - Authentication: You may need to authenticate with Hugging Face to access the dataset:
- Using Tokens: Hugging Face Tokens
- Using SSH Key: Hugging Face SSH Keys
- Clone the dataset using the following commands:
# Navigate to the datasets directory
cd ~/audio2face/datasets
# Make sure git LFS is installed
sudo apt-get install -y git-lfs
git lfs install
# Clone Claire dataset in the datasets directory using https
git clone https://huggingface.co/datasets/nvidia/Audio2Face-3D-Dataset-v1.0.0-claire
# Or alternatively clone Claire dataset in the datasets directory using SSH
git clone git@hf.co:datasets/nvidia/Audio2Face-3D-Dataset-v1.0.0-claire
- Verify the dataset structure:
- After download, your dataset directory should look like this:
/home/<username>/audio2face/datasets/
└── Audio2Face-3D-Dataset-v1.0.0-claire/
├── data/
│ └── claire/
│ ├── audio/
│ ├── cache/
│ └── ...
├── docs/
└── ...
# Navigate to the repository directory
cd ~/audio2face/Audio2Face-3D-Training-Framework
# Add executable permissions
chmod +x docker/*.sh
# Build Docker container
./docker/build_docker.sh
Note: In the next steps, all python run_*.py
commands automatically execute inside Docker containers with pre-configured dependencies.
Python Note: In Ubuntu, the python
command can be python3
. You'll get a warning with the correct spelling for your installation.
# Run preprocessing with example config
python run_preproc.py example-diffusion claire
Once this process is completed, the log will print the Preproc Run Name Full, like this:
This name is important for future steps. It needs to be added to the config_train.py
file located in the configs/example-diffusion
directory. In this file, you need to locate the following section:
PREPROC_RUN_NAME_FULL = {
"claire": "XXXXXX_XXXXXX_example",
}
The value needs to be updated with the name that was provided in the shell log from the preproc script. In the example above, it would be updated as follows:
PREPROC_RUN_NAME_FULL = {
"claire": "250909_135508_example",
}
Note: A new sub-directory is also created in the workspace/output_preproc
directory containing the artifacts of the preproc process.
# Run training example
python run_train.py example-diffusion
Note: The training process can take some time (between 30 and 40 minutes depending on your hardware). The training log provides guidance on how much time is needed to complete the training.
Again, once this process is completed, a new sub-directory will be created in the workspace/output_train
directory. The name of that directory will be reflected in the shell log. It will look like this:
You can use this name as <TRAINING_RUN_NAME_FULL>
in next step.
# run the deploy example
python run_deploy.py example-diffusion <TRAINING_RUN_NAME_FULL>
This process creates a new sub-directory in the workspace/output_deploy
directory. The name of that directory will be reflected in the shell log.
This new directory contains all the files required to use the trained model for inference.
Once training is complete, validate your custom model using one of the following methods:
Option 1: Python Inference: Generate animations in .npy format or Maya cache (.mc) format using the built-in inference engine:
python run_inference.py example-diffusion <TRAINING_RUN_NAME_FULL>
Option 2: Maya-ACE Integration: Deploy and test your model in a visual production environment using Maya and the Maya-ACE plugin.
The Maya-ACE plugin enables real-time visualization of animation inference. It allows you to see the output from a model directly on a character within the Autodesk Maya 3D environment, providing immediate visual feedback for testing and validation
- Documentation: Using Trained Models in Maya-ACE 2.0
- Reference Scene:
Audio2Face-3D-Dataset-v1.0.0-claire/data/claire/geom/fullface/a2f_maya_scene.mb
If you use Audio2Face-3D Training Framework in your research, please cite:
@misc{nvidia2025audio2face3d,
title={Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars},
author={Chaeyeon Chung and Ilya Fedorov and Michael Huang and Aleksey Karmanov and Dmitry Korobchenko and Roger Ribera and Yeongho Seol},
year={2025},
eprint={2508.16401},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2508.16401},
note={Authors listed in alphabetical order}
}