• GitHub • • Paper on arXiv • • Huggingface demo •


XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Edresson Casanova, Kelly Davis, Eren Gölge, Iulian Gulea, Logan Hart, Aya Jafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, and Julian Weber

Abstract:

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although, models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.

System architecture:

Zero-shot Multi-Speaker TTS

Note that for the audio demo all audios was resampled to 24khz to fair comparison.

English samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
Tortoise
StyleTTS 2
Mega-TTS 2
HierSpeech++
Original YourTTS
YourTTS LibriTTS (Exp 1.)
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Chinese samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
Mega-TTS 2
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Portuguese samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
Original YourTTS
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Spanish samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

French samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
Original YourTTS
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

German samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Italian samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Polish samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Turkish samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Russian samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Dutch samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Czech samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Arabic samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Hungarian samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Korean samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Japanese samples

generated_wav
Speaker Name f10_script2 f1_script1 f2_script1 f3_script1 f4_script2 f5_script1 f6_script2 f7_script1 f8_script2 f9_script2 m10_script2 m1_script1 m2_script1 m3_script1 m4_script2 m5_script2 m6_script1 m7_script1 m8_script2 m9_script2
Model Name
Speaker Reference
YourTTS XTTS (Exp 2.)
XTTS (Exp 3.)

Speaker Adaptation

To show the potential of the XTTS model for adaptation to new speakers/recording conditions, we selected samples of approximately 10 min of speech from well-known or unique-style voices (e.g. whispering voices) in different languages. We choose 3 speakers of English, 3 speakers of Portuguese, 1 speaker of Chinese, and 1 speaker of Arabic.

English samples

generated_wav
Speaker Name en_whisper_female en_female en_robo_coqui fr_male1 pt_male1 ar_male zh_male fr_male2 fr_male3 pt_female pt_male2
Model Name
Speaker Reference
XTTS Zero-shot
XTTS Fine-tuning

Chinese samples

generated_wav
Speaker Name en_whisper_female en_female en_robo_coqui fr_male1 pt_male1 ar_male zh_male fr_male2 fr_male3 pt_female pt_male2
Model Name
Speaker Reference
XTTS Zero-shot
XTTS Fine-tuning

Portuguese samples

generated_wav
Speaker Name en_whisper_female en_female en_robo_coqui fr_male1 pt_male1 ar_male zh_male fr_male2 fr_male3 pt_female pt_male2
Model Name
Speaker Reference
XTTS Zero-shot
XTTS Fine-tuning

French samples

generated_wav
Speaker Name en_whisper_female en_female en_robo_coqui fr_male1 pt_male1 ar_male zh_male fr_male2 fr_male3 pt_female pt_male2
Model Name
Speaker Reference
XTTS Zero-shot
XTTS Fine-tuning

Arabic samples

generated_wav
Speaker Name en_whisper_female en_female en_robo_coqui fr_male1 pt_male1 ar_male zh_male fr_male2 fr_male3 pt_female pt_male2
Model Name
Speaker Reference
XTTS Zero-shot
XTTS Fine-tuning

Citation

@inproceedings{casanova24_interspeech,
  title     = {XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model},
  author    = {Edresson Casanova and Kelly Davis and Eren Gölge and Görkem Göknar and Iulian Gulea and Logan Hart and Aya Aljafari and Joshua Meyer and Reuben Morais and Samuel Olayemi and Julian Weber},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {4978--4982},
  doi       = {10.21437/Interspeech.2024-2016},
  issn      = {2958-1796},
}