Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee
Abstract:
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
Codec Reconstruction
Note that for the audio demo all audios was resampled to 22.05kHz to fair comparison.
MLS test set
English samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 10226 | 10453 | 10611 | 10839 | 12119 | 7788 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
Portuguese samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 11247 | 12287 | 12626 | 12670 | 3050 | 4405 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
French samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 2154 | 2216 | 2465 | 4482 | 5207 | 5476 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
Dutch samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 11290 | 3034 | 3798 | 4396 | 4429 | 880 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
German samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 1660 | 2252 | 3363 | 7120 | 7456 | 9494 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
Italian samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 1131 | 4009 | 428 | 6698 | 7372 | 7458 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
Polish samples
Samples | |||
---|---|---|---|
Speaker Name | 8758 | 9098 | 9860 |
Codec | |||
Ground truth | |||
Encodec 6kbps | |||
DAC 7.75kbps | |||
Spectral Codec | |||
Ours 2k codes | |||
Ours 4k codes |
Portuguese samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 11247 | 12287 | 12626 | 12670 | 3050 | 4405 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
Spanish samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 10667 | 3471 | 4096 | 5870 | 6306 | 7510 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
DAPS test set
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | f10 | f10 | m10 | m10 | m10 | m10 |
Codec | ||||||
Ground truth | ||||||
Encodec 6kbps | ||||||
DAC 7.75kbps | ||||||
Spectral Codec | ||||||
Ours 2k codes | ||||||
Ours 4k codes |
ZS-TTS Audio Samples
Note that for the audio demo all audios was resampled to 22.05kHz to fair comparison.
Samples | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | p249 | p260 | p264 | p268 | p269 | p270 | p277 | p292 | p294 | p295 | p297 | p301 | p310 | p314 | p316 | p318 | p341 | p347 | p363 | p376 |
Codec | ||||||||||||||||||||
Ground truth | ||||||||||||||||||||
Reference | ||||||||||||||||||||
Encodec 6kbps | ||||||||||||||||||||
DAC 7.75kbps | ||||||||||||||||||||
Spectral Codec | ||||||||||||||||||||
Ours 2k codes | ||||||||||||||||||||
Ours 4k codes |
Citation
@article{casanova2024low, title={Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference}, author={Casanova, Edresson and Langman, Ryan and Neekhara, Paarth and Hussain, Shehzeen and Li, Jason and Ghosh, Subhankar and Juki{\'c}, Ante and Lee, Sang-gil}, journal={arXiv preprint arXiv:2409.12117}, year={2024} }