Jun-Hyeok Cha, Seung-Bin Kim, Hyung-Seok Oh, and Seong-Whan Lee
Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA modules. We propose an Emotion-aware Q-former encoder, which enables the LLM to perceive emotions in speech. The encoder is trained to align speech emotions with text, utilizing datasets of emotional speech. The entire model is then fine-tuned with conversational speech data to infer emotional context for generating emotional speech that is natural in conversation. Our experimental results demonstrate that JELLY excels in emotional context modeling, synthesizing speech that naturally aligns with conversation, while mitigating the scarcity of emotional conversational speech datasets.
Official implementation of JELLY: https://github.com/jh-cha-prml/JELLY
The overview of JELLY ![]() |
We used the HiFi-GAN vocoder for all audio samples.
Vocoded : Reconstructed from the ground-truth mel-spectrograms.
FastSpeech 2 : Baseline FastSpeech 2 model without context-modeling.
GRU-based : Unidirectional GRU-based method with the ground-truth transcript of each utterance in the conversation history.
ECSS : Heterogeneous graph-based method with the emotion label and the ground-truth transcript of each utterance in the conversation history.
JELLY : Our proposed framework without the emotion label of each utterance in the conversation history.
JELLY (speech_only) : Our proposed framework without the emotion label and ground-truth transcript of each utterance in the conversation history.
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |
Vocoded |
FastSpeech 2 |
GRU-based |
ECSS |
JELLY |
JELLY (speech_only) |