M²-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Jinlong Xue¹, Yayue Deng¹, Fengping Wang¹, Ya Li¹, Yingming Gao¹, Jianhua Tao², Jianqing Sun³, Jiaen Liang³
¹ Beijing University of Posts and Telecommunications, Beijing, China ² NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China ³ Unisound AI Technology Co., Ltd, Beijing, China

Abstract

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M²-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

Model Architecture
Evaluation on different models

Model Architecture

Fig.1 Overview of our proposed M2-CTTS model. M²-CTTS is based on FastSpeech2 and HiFi-GAN. We design a Textual Context Module (TCM) and an Acoustic Context Module (ACM) for modeling. We develop a Text Utterance-Level Module (TUM) and a Text Phoneme-Level Module (TPM) in TCM and develop a Wave Utterance-Level Module (WUM) and a Wave Phoneme-Level Module (WPM) in ACM. Moreover, we use Conditional Decoder and Prosody Predictor Module (PPM) to enhance the expressiveness of the generated speech.

Evaluation on different models

M0: reconstructed from mel spectrogram.
M1: baseline FastSpeech2 model.
M2: baseline model with Tu.
M3: baseline model with Wu.
M4: baseline model with Tu and Tp.
M5: baseline model with Wu and Wp.
M6: baseline model with Tu and Wu.
M7: baseline model with Tu, Tp, Wu and Wp.

Sample 1

history

A: a big one would be nice.
B: how about this one? it's our biggest -- sixteen' in diameter.
A: oh, yes, i like that one, but it's too heavy.
B: ok, try this one. it's made of aluminum.
A: oh, yes! this is much better. but it has an aluminum handle.

M0	M1	M2	M3	M4	M5	M6	M7

Sample 2

history

B: umm i think we should have a house cleaning today. what's your opinion?
A: oh, no. we just did it last week.
B: come on. what do you want to do? washing clothes or cleaning the house?
A: i'd rather wash the clothes.
B: ok. here is the laundry.

M0	M1	M2	M3	M4	M5	M6	M7

Sample 3

history

B: come on. what do you want to do? washing clothes or cleaning the house?
A: i'd rather wash the clothes.
B: ok. here is the laundry.
A: oh, my god! so much!
B: don't worry. i'll help you with it later.

M0	M1	M2	M3	M4	M5	M6	M7

Sample 4

history

A: hi, dear, i've got your voice at last.
B: hi, darling, i am thinking it is the time of your calling.
A: are you ok today?
B: quite good except thinking of you so much.

M0	M1	M2	M3	M4	M5	M6	M7

Sample 5

history

B: what kind of drink do you want with those meals?
A: one coca cola and the other a sprite, please.
B: you can super-size your meals for only three dollars extra.
A: yes, i'd like that, then.
B: how about anything for dessert, like an apple pie or ice cream?

M0	M1	M2	M3	M4	M5	M6	M7

Sample 6

history

B: may i help you?
A: this dress is beautiful. may i try it on?
B: i'm afraid you can't.
A: what is the material of this dress?
B: it's one hundred percent cotton.

M0	M1	M2	M3	M4	M5	M6	M7

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Abstract

Contents

Model Architecture

Evaluation on different models

Sample 1

history

Sample 2

history

Sample 3

history

Sample 4

history

Sample 5

history

Sample 6

history

M²-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis