Jinlong Xue

Hi there! I am currently a master’s student at the Beijing University of Posts and Telecommunications (BUPT), majoring in Artificial Intelligence. I am supervised by Associate Professor Ya Li, focusing on speech synthesis, NLP, and multimodal generation.

I have a keen interest in all aspects of multimodal generation, including speech synthesis, multimodal LLM, and AIGC. I am also interested in developing intelligent and interactive AI systems with human emotions.

I plan to pursue a Ph.D. abroad after completing my master’s degree in 2025. If any professors or researchers are interested in my work or see potential for collaboration, please do not hesitate to contact me!

news

Oct 25, 2024	Our Text-to-Audio model Auffusion project is accepted in TASLP! 🎉
Jun 05, 2024	2 papers are accepted in InterSpeech 2024! demo in MMCE-Qformer-TTS and RAG-TTS 🎉
Jan 02, 2024	Our Text-to-Audio model Auffusion paper, code and project is released! 🎉
Dec 14, 2023	Our ICASSP 2024 paper CONCSS is accepted! 🎉
Jul 30, 2023	Our ACM MM 2023 paper CMCU-CSS is accepted! 🎉

selected publications

2024

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Jinlong Xue, Yayue Deng, Yicheng Han, and 2 more authors

Interspeech, 2024

Abs arXiv PDF

Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios.
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining

Jinlong Xue, Yayue Deng, Yingming Gao, and 1 more author

Interspeech, 2024

Abs arXiv PDF

Recent prompt-based text-to-speech (TTS) models can clone an unseen speaker using only a short speech prompt. They leverage a strong in-context ability to mimic the speech prompts, including speaker style, prosody, and emotion. Therefore, the selection of a speech prompt greatly influences the generated speech, akin to the importance of a prompt in large language models (LLMs). However, current prompt-based TTS models choose the speech prompt manually or simply at random. Hence, in this paper, we adapt retrieval augmented generation (RAG) from LLMs to prompt-based TTS. Unlike traditional RAG methods, we additionally consider contextual information during the retrieval process and present a Context-Aware Contrastive Language-Audio Pre-training (CA-CLAP) model to extract context-aware, style-related features. The objective and subjective evaluations demonstrate that our proposed RAG method outperforms baselines, and our CA-CLAP achieves better results than text-only retrieval methods.
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Jinlong Xue, Yayue Deng, Yingming Gao, and 1 more author

IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2024

Abs arXiv PDF

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion’s superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io/.
CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Yayue Deng, Jinlong Xue, Yukang Jia, and 6 more authors

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

Abs arXiv PDF

Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model’s context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors’ discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.

2023

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis

Yayue Deng, Jinlong Xue, Yingming Gao, and 1 more author

In Proceedings of the 31st ACM International Conference on Multimedia, (MM), 2023

Abs PDF

Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral communication. However, the complexity of context dependency modeling poses significant challenges in the field of CSS, especially the mutual psychological influence between interlocutors. Previous studies have verified that prior commonsense knowledge helps machines understand subtle psychological information (e.g., feelings and intentions) in spontaneous oral dialogues. Therefore, to enhance context understanding and improve the naturalness of synthesized speech, we propose a novel conversational speech synthesis system (CMCU-CSS) that incorporates the Commonsense-based Multi-modal Context Understanding (CMCU) module to model the dynamic emotional interaction among interlocutors. Specifically, we first utilize three implicit states (intent state, internal state and external state) in CMCU to model the context dependency between inter/intra speakers with the help of commonsense knowledge. Furthermore, we infer emotion vectors from the fusion of these implicit states and multi-modal features to enhance the emotion discriminability of synthesized speech. This is the first attempt to combine commonsense knowledge with conversational speech synthesis, and its effect in terms of emotion discriminability of synthetic speech is evaluated by emotion recognition in conversation task. The results of subjective and objective evaluations demonstrate that the CMCU-CSS model achieves more natural speech with context-appropriate emotion and is equipped with the best emotion discriminability, surpassing that of other conversational speech synthesis models.
M2-CTTS: End-to-End Multi-Scale Multi-Modal Conversational Text-to-Speech Synthesis

Jinlong Xue, Yayue Deng, Fengping Wang, and 5 more authors

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

Abs arXiv PDF

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2 -CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Jinlong Xue, Yayue Deng, Yichen Han, and 3 more authors

In 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2022

Abs arXiv PDF

In recent years, neural network based methods for multispeaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better speaker similarity. To efficiently evaluate our synthesized speech, we are the first to adopt and evaluate different deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.