NanoCodec

• GitHub • • Paper on arXiv • • Model Checkpoint •

NanoCodec: Towards High Quality Ultra Fast Speech LLM Inference

Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukić, Jason Li, Boris Ginsburg

Abstract:

Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.

Codec Reconstruction

Note that all sentences were randomly selected to ensure a fair comparison.

MLS test set

English samples

	Samples
Speaker Name	10226	10453	10611	10839	12119	7788
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

Spanish samples

	Samples
Speaker Name	10667	3471	4096	5870	6306	7510
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

French samples

	Samples
Speaker Name	2154	2216	2465	4482	5207	5476
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

Dutch samples

	Samples
Speaker Name	11290	3034	3798	4396	4429	880
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

German samples

	Samples
Speaker Name	1660	2252	3363	7120	7456	9494
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

Italian samples

	Samples
Speaker Name	1131	4009	428	6698	7372	7458
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

Polish samples

	Samples
Speaker Name	8758	9098	9860
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

Portuguese samples

	Samples
Speaker Name	11247	12287	12626	12670	3050	4405
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

DAPS test set

	Samples
Speaker Name	f10	f10	m10	m10	m10	m10
Codec
GT
1.89kbps Low Frame-rate Speech Codec
1.78kbps Ours
1.1kbps Mimi
1.1kbps Ours
1.1kbps Ours 25 FPS
1.1kbps Ours 6.25 FPS
0.9kbps WavTokenizer
0.85kbps TS3 Codec
0.8kbps Ours
0.7kbps TAAE
0.6kbps Ours

Low Frame-rate Speech Codec (LFSC) Streaming Experiments

We simulated streaming inference and evaluated the LFSC model with varying numbers of lookahead frames/tokens. The output perceptual quality was assessed using PESQ, with the results presented in the table below. Our findings indicate that LFSC requires a minimum of five lookahead frames to achieve performance comparable to offline inference. However, even with seven lookahead frames, the model still does not fully match the quality of offline inference. In contrast, our models achieve equivalent PESQ scores without any lookahead. Additionally, we provide audio samples below to illustrate the impact of lookahead tokens on the perceptual quality of the LFSC model.

Model - Condition	PESQ
1.78kbps Ours 12.5 FPS - Offline	2.955
1.78kbps Ours 12.5 FPS - Streaming no lookahed	2.955
1.89kbps Ours 21.5 FPS - Offline	3.087
1.89kbps Ours 21.5 FPS - Streaming no lookahed	3.087
1.89kbps LFSC 21.5 FPS - Offline	3.075
1.89kbps LFSC 21.5 FPS - Streaming 3 frames lookahed	2.912
1.89kbps LFSC 21.5 FPS - Streaming 4 frames lookahed	3.064
1.89kbps LFSC 21.5 FPS - Streaming 5 frames lookahed	3.071
1.89kbps LFSC 21.5 FPS - Streaming 6 frames lookahed	3.072
1.89kbps LFSC 21.5 FPS - Streaming 7 frames lookahed	3.074

DAPS test set

	Samples
Speaker Name	f10	f10	m10	m10	m10	m10
Codec
GT
1.78kbps Ours 12.5 FPS - Offline
1.78kbps Ours 12.5 FPS - Streaming no lookahead
1.89kbps Ours 21.5 FPS - Offline

1.89kbps LFSC 21.5 FPS - Offline
1.89kbps LFSC 21.5 FPS - Streaming 3 frames lookahead
1.89kbps LFSC 21.5 FPS - Streaming 4 frames lookahead
1.89kbps LFSC 21.5 FPS - Streaming 5 frames lookahead
1.89kbps LFSC 21.5 FPS - Streaming 6 frames lookahead
1.89kbps LFSC 21.5 FPS - Streaming 7 frames lookahead

ZS-TTS Audio Samples

	Samples
Sample ID	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
Codec
Ground truth
Reference
LFSC 12.5 FPS 1.89 kbps
NanoCodec 21.5 FPS 1.89 kbps
NanoCodec 12.5 FPS 1.78 kbps
NanoCodec 12.5 FPS 1.1 kbps
NanoCodec 12.5 FPS 1.78 kbps 10s context

NanoCodec

Towards High Quality Ultra Fast Speech LLM Inference

NanoCodec: Towards High Quality Ultra Fast Speech LLM Inference

Abstract:

Codec Reconstruction

MLS test set

DAPS test set

Low Frame-rate Speech Codec (LFSC) Streaming Experiments

DAPS test set

ZS-TTS Audio Samples

Citation