Universal vocoder

Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

Dipjyoti Paul, Yannis Pantazis and Yannis Stylianou

Abstract :- Recent advancements in deep learning led to human-level performance in single-speaker speech synthesis. However, there are still limitations in terms of speech quality when generalizing those systems into multiple-speaker models especially for unseen speakers and unseen recording qualities. For instance, conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers. In this work, we propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). We target towards the development of an efficient universal vocoder even for unseen speakers and recording conditions. In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings. Using publicly-available data for training, SC-WaveRNN achieves significantly better performance over baseline WaveRNN on both subjective and objective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23% for seen speaker and seen recording condition and up to 95% for unseen speaker and unseen condition. Finally, we extend our work by implementing a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. In terms of performance, our system has been preferred over the baseline TTS system by 60% over 15.5% and by 60.9% over 32.6%, for seen and unseen speakers, respectively.

Audio Samples:

Universal vocoder:

Seen Speakers and seen sound quality:

Original	WaveRNN	SC-WaveRNN

Unseen Speakers and seen sound quality:

Original	WaveRNN	SC-WaveRNN

Unseen Speakers and unseen sound quality:

Original	WaveRNN	SC-WaveRNN

Zero-shot TTS:

Seen Speakers:

Reference	Baseline TTS [1]	Proposed TTS

Unseen Speakers:

Reference	Baseline TTS [1]	Proposed TTS

Reference:

[1] "Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings" Erica Cooper, Cheng I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi in ICASSP 2020, pp. 6184-6188.

*ENRICH has received funding from the EU H2020 research and innovation programme under the MSCA GA 675324