XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Edresson Casanova, Kelly Davis, Eren Gölge, Iulian Gulea, Logan Hart, Aya Jafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, and Julian Weber
Abstract:
Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although, models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.
System architecture:
Zero-shot Multi-Speaker TTS
Note that for the audio demo all audios was resampled to 24khz to fair comparison.
English samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
Tortoise | ||||||||||||||||||||
StyleTTS 2 | ||||||||||||||||||||
Mega-TTS 2 | ||||||||||||||||||||
HierSpeech++ | ||||||||||||||||||||
Original YourTTS | ||||||||||||||||||||
YourTTS LibriTTS (Exp 1.) | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Chinese samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
Mega-TTS 2 | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Portuguese samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
Original YourTTS | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Spanish samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
French samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
Original YourTTS | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
German samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Italian samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Polish samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Turkish samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Russian samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Dutch samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Czech samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Arabic samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Hungarian samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Korean samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Japanese samples
generated_wav | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | f10_script2 | f1_script1 | f2_script1 | f3_script1 | f4_script2 | f5_script1 | f6_script2 | f7_script1 | f8_script2 | f9_script2 | m10_script2 | m1_script1 | m2_script1 | m3_script1 | m4_script2 | m5_script2 | m6_script1 | m7_script1 | m8_script2 | m9_script2 |
Model Name | ||||||||||||||||||||
Speaker Reference | ||||||||||||||||||||
YourTTS XTTS (Exp 2.) | ||||||||||||||||||||
XTTS (Exp 3.) |
Speaker Adaptation
To show the potential of the XTTS model for adaptation to new speakers/recording conditions, we selected samples of approximately 10 min of speech from well-known or unique-style voices (e.g. whispering voices) in different languages. We choose 3 speakers of English, 3 speakers of Portuguese, 1 speaker of Chinese, and 1 speaker of Arabic.
English samples
generated_wav | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | en_whisper_female | en_female | en_robo_coqui | fr_male1 | pt_male1 | ar_male | zh_male | fr_male2 | fr_male3 | pt_female | pt_male2 |
Model Name | |||||||||||
Speaker Reference | |||||||||||
XTTS Zero-shot | |||||||||||
XTTS Fine-tuning |
Chinese samples
generated_wav | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | en_whisper_female | en_female | en_robo_coqui | fr_male1 | pt_male1 | ar_male | zh_male | fr_male2 | fr_male3 | pt_female | pt_male2 |
Model Name | |||||||||||
Speaker Reference | |||||||||||
XTTS Zero-shot | |||||||||||
XTTS Fine-tuning |
Portuguese samples
generated_wav | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | en_whisper_female | en_female | en_robo_coqui | fr_male1 | pt_male1 | ar_male | zh_male | fr_male2 | fr_male3 | pt_female | pt_male2 |
Model Name | |||||||||||
Speaker Reference | |||||||||||
XTTS Zero-shot | |||||||||||
XTTS Fine-tuning |
French samples
generated_wav | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | en_whisper_female | en_female | en_robo_coqui | fr_male1 | pt_male1 | ar_male | zh_male | fr_male2 | fr_male3 | pt_female | pt_male2 |
Model Name | |||||||||||
Speaker Reference | |||||||||||
XTTS Zero-shot | |||||||||||
XTTS Fine-tuning |
Arabic samples
generated_wav | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Speaker Name | en_whisper_female | en_female | en_robo_coqui | fr_male1 | pt_male1 | ar_male | zh_male | fr_male2 | fr_male3 | pt_female | pt_male2 |
Model Name | |||||||||||
Speaker Reference | |||||||||||
XTTS Zero-shot | |||||||||||
XTTS Fine-tuning |
Citation
@inproceedings{casanova24_interspeech, title = {XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model}, author = {Edresson Casanova and Kelly Davis and Eren Gölge and Görkem Göknar and Iulian Gulea and Logan Hart and Aya Aljafari and Joshua Meyer and Reuben Morais and Samuel Olayemi and Julian Weber}, year = {2024}, booktitle = {Interspeech 2024}, pages = {4978--4982}, doi = {10.21437/Interspeech.2024-2016}, issn = {2958-1796}, }