Low Frame-rate Speech Codec

• GitHub • • Paper on arXiv • • Model Checkpoint • • MLS 44 kHz test dataset•

Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference

Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee

Abstract:

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.

Codec Reconstruction

Note that for the audio demo all audios was resampled to 22.05kHz to fair comparison.

MLS test set

English samples

	Samples
Speaker Name	10226	10453	10611	10839	12119	7788
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Portuguese samples

	Samples
Speaker Name	11247	12287	12626	12670	3050	4405
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

French samples

	Samples
Speaker Name	2154	2216	2465	4482	5207	5476
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Dutch samples

	Samples
Speaker Name	11290	3034	3798	4396	4429	880
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

German samples

	Samples
Speaker Name	1660	2252	3363	7120	7456	9494
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Italian samples

	Samples
Speaker Name	1131	4009	428	6698	7372	7458
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Polish samples

	Samples
Speaker Name	8758	9098	9860
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Portuguese samples

	Samples
Speaker Name	11247	12287	12626	12670	3050	4405
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Spanish samples

	Samples
Speaker Name	10667	3471	4096	5870	6306	7510
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

DAPS test set

	Samples
Speaker Name	f10	f10	m10	m10	m10	m10
Codec
Ground truth
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

ZS-TTS Audio Samples

Note that for the audio demo all audios was resampled to 22.05kHz to fair comparison.

	Samples
Speaker Name	p249	p260	p264	p268	p269	p270	p277	p292	p294	p295	p297	p301	p310	p314	p316	p318	p341	p347	p363	p376
Codec
Ground truth
Reference
Encodec 6kbps
DAC 7.75kbps
Spectral Codec
Ours 2k codes
Ours 4k codes

Citation


        @article{casanova2024low,
          title={Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference},
          author={Casanova, Edresson and Langman, Ryan and Neekhara, Paarth and Hussain, Shehzeen and Li, Jason and Ghosh, Subhankar and Juki{\'c}, Ante and Lee, Sang-gil},
          journal={arXiv preprint arXiv:2409.12117},
          year={2024}
        }