M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Jinlong Xue1, Yayue Deng1, Fengping Wang1, Ya Li1, Yingming Gao1, Jianhua Tao2, Jianqing Sun3, Jiaen Liang3

1 Beijing University of Posts and Telecommunications, Beijing, China
2 NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China
3 Unisound AI Technology Co., Ltd, Beijing, China

Abstract

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

Contents

Model Architecture


Fig.1 Overview of our proposed M2-CTTS model. M2-CTTS is based on FastSpeech2 and HiFi-GAN. We design a Textual Context Module (TCM) and an Acoustic Context Module (ACM) for modeling. We develop a Text Utterance-Level Module (TUM) and a Text Phoneme-Level Module (TPM) in TCM and develop a Wave Utterance-Level Module (WUM) and a Wave Phoneme-Level Module (WPM) in ACM. Moreover, we use Conditional Decoder and Prosody Predictor Module (PPM) to enhance the expressiveness of the generated speech.


Evaluation on different models

M0: reconstructed from mel spectrogram.
M1: baseline FastSpeech2 model.
M2: baseline model with Tu.
M3: baseline model with Wu.
M4: baseline model with Tu and Tp.
M5: baseline model with Wu and Wp.
M6: baseline model with Tu and Wu.
M7: baseline model with Tu, Tp, Wu and Wp.

Sample 1

history

A: a big one would be nice.
B: how about this one? it's our biggest -- sixteen' in diameter.
A: oh, yes, i like that one, but it's too heavy.
B: ok, try this one. it's made of aluminum.
A: oh, yes! this is much better. but it has an aluminum handle.

M0 M1 M2 M3 M4 M5 M6 M7


Sample 2

history

B: umm i think we should have a house cleaning today. what's your opinion?
A: oh, no. we just did it last week.
B: come on. what do you want to do? washing clothes or cleaning the house?
A: i'd rather wash the clothes.
B: ok. here is the laundry.

M0 M1 M2 M3 M4 M5 M6 M7


Sample 3

history

B: come on. what do you want to do? washing clothes or cleaning the house?
A: i'd rather wash the clothes.
B: ok. here is the laundry.
A: oh, my god! so much!
B: don't worry. i'll help you with it later.

M0 M1 M2 M3 M4 M5 M6 M7


Sample 4

history

A: hi, dear, i've got your voice at last.
B: hi, darling, i am thinking it is the time of your calling.
A: are you ok today?
B: quite good except thinking of you so much.

M0 M1 M2 M3 M4 M5 M6 M7


Sample 5

history

B: what kind of drink do you want with those meals?
A: one coca cola and the other a sprite, please.
B: you can super-size your meals for only three dollars extra.
A: yes, i'd like that, then.
B: how about anything for dessert, like an apple pie or ice cream?

M0 M1 M2 M3 M4 M5 M6 M7


Sample 6

history

B: may i help you?
A: this dress is beautiful. may i try it on?
B: i'm afraid you can't.
A: what is the material of this dress?
B: it's one hundred percent cotton.

M0 M1 M2 M3 M4 M5 M6 M7