NanoCodec: Towards High Quality Ultra Fast Speech LLM Inference
Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukić, Jason Li, Boris Ginsburg
Abstract:
Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.
Codec Reconstruction
Note that all sentences were randomly selected to ensure a fair comparison.MLS test set
English samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 10226 | 10453 | 10611 | 10839 | 12119 | 7788 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
Spanish samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 10667 | 3471 | 4096 | 5870 | 6306 | 7510 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
French samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 2154 | 2216 | 2465 | 4482 | 5207 | 5476 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
Dutch samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 11290 | 3034 | 3798 | 4396 | 4429 | 880 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
German samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 1660 | 2252 | 3363 | 7120 | 7456 | 9494 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
Italian samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 1131 | 4009 | 428 | 6698 | 7372 | 7458 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
Polish samples
Samples | |||
---|---|---|---|
Speaker Name | 8758 | 9098 | 9860 |
Codec | |||
GT | |||
1.89kbps Low Frame-rate Speech Codec | |||
1.78kbps Ours | |||
1.1kbps Mimi | |||
1.1kbps Ours | |||
1.1kbps Ours 25 FPS | |||
1.1kbps Ours 6.25 FPS | |||
0.9kbps WavTokenizer | |||
0.85kbps TS3 Codec | |||
0.8kbps Ours | |||
0.7kbps TAAE | |||
0.6kbps Ours |
Portuguese samples
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | 11247 | 12287 | 12626 | 12670 | 3050 | 4405 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
DAPS test set
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | f10 | f10 | m10 | m10 | m10 | m10 |
Codec | ||||||
GT | ||||||
1.89kbps Low Frame-rate Speech Codec | ||||||
1.78kbps Ours | ||||||
1.1kbps Mimi | ||||||
1.1kbps Ours | ||||||
1.1kbps Ours 25 FPS | ||||||
1.1kbps Ours 6.25 FPS | ||||||
0.9kbps WavTokenizer | ||||||
0.85kbps TS3 Codec | ||||||
0.8kbps Ours | ||||||
0.7kbps TAAE | ||||||
0.6kbps Ours |
Low Frame-rate Speech Codec (LFSC) Streaming Experiments
We simulated streaming inference and evaluated the LFSC model with varying numbers of lookahead frames/tokens. The output perceptual quality was assessed using PESQ, with the results presented in the table below. Our findings indicate that LFSC requires a minimum of five lookahead frames to achieve performance comparable to offline inference. However, even with seven lookahead frames, the model still does not fully match the quality of offline inference. In contrast, our models achieve equivalent PESQ scores without any lookahead. Additionally, we provide audio samples below to illustrate the impact of lookahead tokens on the perceptual quality of the LFSC model.Model - Condition | PESQ |
---|---|
1.78kbps Ours 12.5 FPS - Offline | 2.955 |
1.78kbps Ours 12.5 FPS - Streaming no lookahed | 2.955 |
1.89kbps Ours 21.5 FPS - Offline | 3.087 |
1.89kbps Ours 21.5 FPS - Streaming no lookahed | 3.087 |
1.89kbps LFSC 21.5 FPS - Offline | 3.075 |
1.89kbps LFSC 21.5 FPS - Streaming 3 frames lookahed | 2.912 |
1.89kbps LFSC 21.5 FPS - Streaming 4 frames lookahed | 3.064 |
1.89kbps LFSC 21.5 FPS - Streaming 5 frames lookahed | 3.071 |
1.89kbps LFSC 21.5 FPS - Streaming 6 frames lookahed | 3.072 |
1.89kbps LFSC 21.5 FPS - Streaming 7 frames lookahed | 3.074 |
DAPS test set
Samples | ||||||
---|---|---|---|---|---|---|
Speaker Name | f10 | f10 | m10 | m10 | m10 | m10 |
Codec | ||||||
GT | ||||||
1.78kbps Ours 12.5 FPS - Offline | ||||||
1.78kbps Ours 12.5 FPS - Streaming no lookahead | ||||||
1.89kbps Ours 21.5 FPS - Offline | ||||||
1.89kbps LFSC 21.5 FPS - Offline | ||||||
1.89kbps LFSC 21.5 FPS - Streaming 3 frames lookahead | ||||||
1.89kbps LFSC 21.5 FPS - Streaming 4 frames lookahead | ||||||
1.89kbps LFSC 21.5 FPS - Streaming 5 frames lookahead | ||||||
1.89kbps LFSC 21.5 FPS - Streaming 6 frames lookahead | ||||||
1.89kbps LFSC 21.5 FPS - Streaming 7 frames lookahead |
ZS-TTS Audio Samples
Samples | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
Codec | |||||||||||||||||||||
Ground truth | |||||||||||||||||||||
Reference | |||||||||||||||||||||
LFSC 12.5 FPS 1.89 kbps | |||||||||||||||||||||
NanoCodec 21.5 FPS 1.89 kbps | |||||||||||||||||||||
NanoCodec 12.5 FPS 1.78 kbps | |||||||||||||||||||||
NanoCodec 12.5 FPS 1.1 kbps | |||||||||||||||||||||
NanoCodec 12.5 FPS 1.78 kbps 10s context |