M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis
Abstract
Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.
Contents
Model Architecture
Evaluation on different models
M0: reconstructed from mel spectrogram.
M1: baseline FastSpeech2 model.
M2: baseline model with Tu.
M3: baseline model with Wu.
M4: baseline model with Tu and Tp.
M5: baseline model with Wu and Wp.
M6: baseline model with Tu and Wu.
M7: baseline model with Tu, Tp, Wu and Wp.
Sample 1
history
A: a big one would be nice.
B: how about this one? it's our biggest -- sixteen' in diameter.
A: oh, yes, i like that one, but it's too heavy.
B: ok, try this one. it's made of aluminum.
A: oh, yes! this is much better. but it has an aluminum handle.
M0 | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
---|---|---|---|---|---|---|---|
Sample 2
history
B: umm i think we should have a house cleaning today. what's your opinion?
A: oh, no. we just did it last week.
B: come on. what do you want to do? washing clothes or cleaning the house?
A: i'd rather wash the clothes.
B: ok. here is the laundry.
M0 | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
---|---|---|---|---|---|---|---|
Sample 3
history
B: come on. what do you want to do? washing clothes or cleaning the house?
A: i'd rather wash the clothes.
B: ok. here is the laundry.
A: oh, my god! so much!
B: don't worry. i'll help you with it later.
M0 | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
---|---|---|---|---|---|---|---|
Sample 4
history
A: hi, dear, i've got your voice at last.
B: hi, darling, i am thinking it is the time of your calling.
A: are you ok today?
B: quite good except thinking of you so much.
M0 | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
---|---|---|---|---|---|---|---|
Sample 5
history
B: what kind of drink do you want with those meals?
A: one coca cola and the other a sprite, please.
B: you can super-size your meals for only three dollars extra.
A: yes, i'd like that, then.
B: how about anything for dessert, like an apple pie or ice cream?
M0 | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
---|---|---|---|---|---|---|---|
Sample 6
history
B: may i help you?
A: this dress is beautiful. may i try it on?
B: i'm afraid you can't.
A: what is the material of this dress?
B: it's one hundred percent cotton.
M0 | M1 | M2 | M3 | M4 | M5 | M6 | M7 |
---|---|---|---|---|---|---|---|