A semantic–acoustic dual-stream speech codec achieving state-of-the-art performance in speech reconstruction and semantic representation across bitrates.
conda create -n sac python=3.10
conda activate sac
pip install -r requirements.txt # pip version == 24.0
To use SAC, you need to prepare the pretrained dependencies, including the GLM-4-Voice-Tokenizer for semantic tokenization and the ERes2Net speaker encoder for speaker feature extraction (during codec training). Make sure the corresponding model paths are correctly set in your configuration file (e.g., configs/xxx.yaml
).
The following table lists the available SAC checkpoints:
Model Name | Hugging Face | Sample Rate | Token Rate | BPS |
---|---|---|---|---|
SAC | 🤗 Soul-AILab/SAC-16k-37_5Hz | 16 kHz | 37.5 Hz | 525 |
SAC | 🤗 Soul-AILab/SAC-16k-62_5Hz | 16 kHz | 62.5 Hz | 875 |
To perform audio reconstruction, you can use the following command:
python -m bins.infer
We also provide batch scripts for audio reconstruction, encoding, decoding, and embedding extraction in the scripts/batch
directory as references (you can refer to the batch scripts guide for details).
You can run the following command to perform evaluation:
bash scripts/eval.sh
For details on dataset preparation and evaluation setup, please first refer to the evaluation guide.
Before training, organize your dataset in JSONL format. You can refer to example/training_data.jsonl
. Each entry should include:
- utt — unique utterance ID (customizable)
- wav_path — path to raw audio
- ssl_path — path to offline-extracted Whisper features (for semantic supervision)
- semantic_token_path — path to offline-extracted semantic tokens
To accelerate training, you need to extract semantic tokens and Whisper features offline first before starting. Refer to the feature extraction guide for detailed instructions.
You can adjust training and DeepSpeed configurations by editing:
configs/xxx.yaml
— main training configurationconfigs/ds_stage2.json
— DeepSpeed configuration
Run the following script to start SAC training:
bash scripts/train.sh
Our codebase builds upon the awesome SparkVox and DAC. We thank the authors for their excellent work.
If you find this work useful in your research, please consider citing:
@misc{chen2025sacneuralspeechcodec,
title={SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization},
author={Wenxi Chen and Xinsheng Wang and Ruiqi Yan and Yushen Chen and Zhikang Niu and Ziyang Ma and Xiquan Li and Yuzhe Liang and Hanlin Wen and Shunshun Yin and Ming Tao and Xie Chen},
year={2025},
eprint={2510.16841},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2510.16841},
}
This project is licensed under the Apache 2.0 License.