ENRICH

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Rényi Divergence Minimization

Dipjyoti Paul, Sankar Mukherjee, Yannis Pantazis and Yannis Stylianou


Abstract :- In this paper, we present a universal multi-speaker, multi-style Text-to-Speech (TTS) synthesis system which is able to generate speech from text with speaker characteristics and speaking style similar to a given reference signal. Training is conducted on non-parallel data and generates voices in an unsupervised manner, i.e., neither style annotation nor speaker label are required. To avoid leaking content information into the style embeddings (referred to as "content leakage") and leaking speaker information into style embeddings (referred to as "style leakage") we suggest a novel Rényi Divergence based Disentangled Representation framework through adversarial learning. Similar to mutual information minimization, the proposed approach explicitly estimates via a variational formula and then minimizes the Rényi divergence between the joint distribution and the product of marginals for the content-style and style-speaker pairs. By doing so, content, style and speaker spaces become representative and (ideally) independent of each other. Our proposed system greatly reduces content leakage by improving the word error rate by approximately 17-19% relative to the baseline system. In MOS-speech-quality, the proposed algorithm achieves an improvement of about 16-20% whereas MOS-style-similarly boost up 15% relative performance.


Audio Samples:

Text :- Many farmers cannot even agree within their own families.

Reference Speaker Reference Style
UTTS UTTS MINE
UTTS S-RDDR UTTS H-RDDR

Text :- In his absence, the council adopted the change.

Reference Speaker Reference Style
UTTS UTTS MINE
UTTS S-RDDR UTTS H-RDDR

Text :- There is according to legend, a boiling pot of gold at one end.

Reference Speaker Reference Style
UTTS UTTS MINE
UTTS S-RDDR UTTS H-RDDR

Text :- Ask her to bring these things with her from the store.

Reference Speaker Reference Style
UTTS UTTS MINE
UTTS S-RDDR UTTS H-RDDR

Text :- That, however, can only be achieved by constant investment.

Reference Speaker Reference Style
UTTS UTTS MINE
UTTS S-RDDR UTTS H-RDDR


*ENRICH has received funding from the EU H2020 research and innovation programme under the MSCA GA 675324