YourTTS

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Zero-shot Multi-Speaker TTS

Speaker Adaptation

Zero-Shot Voice Conversion

Abstract:

System architecture:

English speakers to English Speakers

Portuguese Speakers to Portuguese Speakers

English Speakers to Portuguese Speakers

Portuguese Speakers to English Speakers

Citation

• GitHub • • Paper on arXiv • • Colab demos •

Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge and Moacir Antonelli Ponti

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

Audio Samples for VCTK test speakers

Model

Unseen Speakers (test)

p225

p234

p238

p245

p248

p261

p294

p302

p326

p335

p347

Emb reference

Exp.1

Exp.1 + SCL

Exp.2

Exp.2 + SCL

Exp.3

Exp.3 + SCL

Exp.4 + SCL

Audio Samples for LibriTTS test speakers

Model	Unseen Speakers (test)
	1089	1188	121	1284	1580	1995	2300	237	260	908
Emb reference
Ground truth
Exp.1
Exp.1 + SCL
Exp.2
Exp.2 + SCL
Exp.3
Exp.3 + SCL
Exp.4 + SCL

Audio Samples for MLS Portuguese test speakers

Model	Unseen Speakers (test)
	11995	12249	12287	12710	13069	3050	4367	5677	7925	9351
Emb reference
Exp.2
Exp.2 + SCL
Exp.3
Exp.3 + SCL
Exp.4 + SCL

Exp.4 + SCL

Mode	Unseen Common Voice Speakers
	English Male	English Female	Portuguese Male	Portuguese Female
Ground Truth
Zero-shot
Fine-Tuned

Exp.4 + SCL

Each row of the table shows the voice of the speaker of the current row generated through a reference of the speaker present in the column. Therefore, all samples of a row should sound similar.

Female to Female

Male to Male

Female to Male

Male to Female

Female to Female

Male to Male

Female to Male

Male to Female

Female to Female

Male to Male

Female to Male

Male to Female

Female to Female

Male to Male

Female to Male

Male to Female

Unseen Speakers (test)

@ARTICLE{2021arXiv211202418C,
  author = {{Casanova}, Edresson and {Weber}, Julian and {Shulby}, Christopher and {Junior}, Arnaldo Candido and {G{\"o}lge}, Eren and {Antonelli Ponti}, Moacir},
  title = "{YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone}",
  journal = {arXiv e-prints},
  keywords = {Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing},
  year = 2021,
  month = dec,
  eid = {arXiv:2112.02418},
  pages = {arXiv:2112.02418},
  archivePrefix = {arXiv},
  eprint = {2112.02418},
  primaryClass = {cs.SD},
  adsurl = {https://ui.adsabs.harvard.edu/abs/2021arXiv211202418C},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}