Continual Speaker Adapation for Text-to-Speech Synthesis

* All audio files are reconstructed with the same vocoder (even the dataset samples)

Comparison of Different Methods

The results from all episode-based methods (gray rows) are synthesized in the last episode.

-> Dataset: VCTK

Method	Random Speaker (p226)	Random Speaker (p316)	Random Speaker (p284)
Dataset Sample
JT (Upper Bound)
SA (Baseline)
EWC
ER (BS:10)
ER-KD (BS:10)

-> Dataset: CVDE

Method	Random Speaker (13)	Random Speaker (51) *the only female speaker in the dataset	Random Speaker (19)
Dataset Sample
JT (Upper Bound)
SA (Baseline)
EWC
ER (BS:10)
ER-KD (BS:10)

Experience Replay - Buffer Replicate (ER-BR)

ER-BR mainly improves pace of the synthesized speech and in some cases it also the audio quality.
However it may also fail for difficult speakers (in the examples: the female speaker #51) since there is no similar speaker in dataset in terms of vocal characteristics.

Speaker: 01

Dataset Sample

Method	Sentence 1	Sentence 2
SA
ER (BS:1)
ER-BR (BS:1)

Speaker: 32

Dataset Sample

Method	Sentence 1	Sentence 2
SA
ER (BS:1)
ER-BR (BS:1)

Speaker: 51

* In this example ER-BR also fails to synthesize speech for the previous speaker and can only partially recover the vocal characteristics.

Dataset Sample

Method	Sentence 1	Sentence 2
SA
ER (BS:1)
ER-BR (BS:1)