KEMBAR78
GitHub - Soul-AILab/SAC: Trainging, inference, and testing of the SAC speech codec model.
Skip to content

Soul-AILab/SAC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

Demo Page arXiv Hugging Face

A semantic–acoustic dual-stream speech codec achieving state-of-the-art performance in speech reconstruction and semantic representation across bitrates.

🛠️ Environment Setup

conda create -n sac python=3.10
conda activate sac
pip install -r requirements.txt  # pip version == 24.0

🧩 Model Checkpoints

To use SAC, you need to prepare the pretrained dependencies, including the GLM-4-Voice-Tokenizer for semantic tokenization and the ERes2Net speaker encoder for speaker feature extraction (during codec training). Make sure the corresponding model paths are correctly set in your configuration file (e.g., configs/xxx.yaml).

The following table lists the available SAC checkpoints:

Model Name Hugging Face Sample Rate Token Rate BPS
SAC 🤗 Soul-AILab/SAC-16k-37_5Hz 16 kHz 37.5 Hz 525
SAC 🤗 Soul-AILab/SAC-16k-62_5Hz 16 kHz 62.5 Hz 875

🎧 Inference

To perform audio reconstruction, you can use the following command:

python -m bins.infer

We also provide batch scripts for audio reconstruction, encoding, decoding, and embedding extraction in the scripts/batch directory as references (you can refer to the batch scripts guide for details).

🧪 Evaluation

You can run the following command to perform evaluation:

bash scripts/eval.sh

For details on dataset preparation and evaluation setup, please first refer to the evaluation guide.

🚀 Training

Step 1: Prepare training data

Before training, organize your dataset in JSONL format. You can refer to example/training_data.jsonl. Each entry should include:

  • utt — unique utterance ID (customizable)
  • wav_path — path to raw audio
  • ssl_path — path to offline-extracted Whisper features (for semantic supervision)
  • semantic_token_path — path to offline-extracted semantic tokens

To accelerate training, you need to extract semantic tokens and Whisper features offline first before starting. Refer to the feature extraction guide for detailed instructions.

Step 2: Modify configuration files

You can adjust training and DeepSpeed configurations by editing:

Step 3: Start training

Run the following script to start SAC training:

bash scripts/train.sh

🙏 Acknowledgement

Our codebase builds upon the awesome SparkVox and DAC. We thank the authors for their excellent work.

🔖 Citation

If you find this work useful in your research, please consider citing:

@misc{chen2025sacneuralspeechcodec,
      title={SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization}, 
      author={Wenxi Chen and Xinsheng Wang and Ruiqi Yan and Yushen Chen and Zhikang Niu and Ziyang Ma and Xiquan Li and Yuzhe Liang and Hanlin Wen and Shunshun Yin and Ming Tao and Xie Chen},
      year={2025},
      eprint={2510.16841},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2510.16841}, 
}

📜 License

This project is licensed under the Apache 2.0 License.