Continual Speaker Adapation for Text-to-Speech Synthesis


* All audio files are reconstructed with the same vocoder (even the dataset samples)

Comparison of Different Methods

Illustration

The results from all episode-based methods (gray rows) are synthesized in the last episode.

-> Dataset: VCTK


Method Random Speaker (p226) Random Speaker (p316) Random Speaker (p284)
Dataset Sample
JT (Upper Bound)
SA (Baseline)
EWC
ER (BS:10)
ER-KD (BS:10)

-> Dataset: CVDE


Method Random Speaker (13) Random Speaker (51)
*the only female speaker in the dataset
Random Speaker (19)
Dataset Sample
JT (Upper Bound)
SA (Baseline)
EWC
ER (BS:10)
ER-KD (BS:10)



Experience Replay - Buffer Replicate (ER-BR)

ER-BR mainly improves pace of the synthesized speech and in some cases it also the audio quality.
However it may also fail for difficult speakers (in the examples: the female speaker #51) since there is no similar speaker in dataset in terms of vocal characteristics.

Speaker: 01

Dataset Sample

Method Sentence 1 Sentence 2
SA
ER (BS:1)
ER-BR (BS:1)



Speaker: 32

Dataset Sample

Method Sentence 1 Sentence 2
SA
ER (BS:1)
ER-BR (BS:1)



Speaker: 51

* In this example ER-BR also fails to synthesize speech for the previous speaker and can only partially recover the vocal characteristics.
Dataset Sample

Method Sentence 1 Sentence 2
SA
ER (BS:1)
ER-BR (BS:1)