• GitHub • • Paper on arXiv • • Model Checkpoint • • MLS 44 kHz test dataset•


Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference

Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee

Abstract:

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.

Codec Reconstruction

Note that for the audio demo all audios was resampled to 22.05kHz to fair comparison.

MLS test set

English samples

Samples
Speaker Name 10226 10453 10611 10839 12119 7788
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Portuguese samples

Samples
Speaker Name 11247 12287 12626 12670 3050 4405
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

French samples

Samples
Speaker Name 2154 2216 2465 4482 5207 5476
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Dutch samples

Samples
Speaker Name 11290 3034 3798 4396 4429 880
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

German samples

Samples
Speaker Name 1660 2252 3363 7120 7456 9494
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Italian samples

Samples
Speaker Name 1131 4009 428 6698 7372 7458
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Polish samples

Samples
Speaker Name 8758 9098 9860
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Portuguese samples

Samples
Speaker Name 11247 12287 12626 12670 3050 4405
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Spanish samples

Samples
Speaker Name 10667 3471 4096 5870 6306 7510
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

DAPS test set

Samples
Speaker Name f10 f10 m10 m10 m10 m10
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes


​​​​​​​​​​​​​​​​​​​

ZS-TTS Audio Samples

Note that for the audio demo all audios was resampled to 22.05kHz to fair comparison.

Samples
Speaker Name p249 p260 p264 p268 p269 p270 p277 p292 p294 p295 p297 p301 p310 p314 p316 p318 p341 p347 p363 p376
Codec
Ground truth
Reference
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Citation


        @article{casanova2024low,
          title={Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference},
          author={Casanova, Edresson and Langman, Ryan and Neekhara, Paarth and Hussain, Shehzeen and Li, Jason and Ghosh, Subhankar and Juki{\'c}, Ante and Lee, Sang-gil},
          journal={arXiv preprint arXiv:2409.12117},
          year={2024}
        }