JELLY Demo

Abstract

Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA modules. We propose an Emotion-aware Q-former encoder, which enables the LLM to perceive emotions in speech. The encoder is trained to align speech emotions with text, utilizing datasets of emotional speech. The entire model is then fine-tuned with conversational speech data to infer emotional context for generating emotional speech that is natural in conversation. Our experimental results demonstrate that JELLY excels in emotional context modeling, synthesizing speech that naturally aligns with conversation, while mitigating the scarcity of emotional conversational speech datasets.

Official implementation of JELLY: https://github.com/jh-cha-prml/JELLY

Model Architecture

The overview of JELLY

Experiments

We used the HiFi-GAN vocoder for all audio samples.

Vocoded : Reconstructed from the ground-truth mel-spectrograms.
FastSpeech 2 : Baseline FastSpeech 2 model without context-modeling.
GRU-based : Unidirectional GRU-based method with the ground-truth transcript of each utterance in the conversation history.
ECSS : Heterogeneous graph-based method with the emotion label and the ground-truth transcript of each utterance in the conversation history.
JELLY : Our proposed framework without the emotion label of each utterance in the conversation history.
JELLY (speech_only) : Our proposed framework without the emotion label and ground-truth transcript of each utterance in the conversation history.

Conversation 1

A (neutral) : I am looking for a pan.
B (neutral) : No problem. What size would you like?
A (neutral) : A big one would be nice.
B (neutral) : How about this one? It's our biggest — sixteen' in diameter.
A (medium happy) : Oh, yes, I like that one, but it's too heavy.
B (neutral) : OK, try this one. It's made of aluminum.

Current turn
A (medium happy) : Oh, yes! This is much better. But it has an aluminum handle.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 2

A (neutral) : Excuse me, but do you have this T-shirt in size L?
B (neutral) : Sorry. We're out of size L's.

Current turn
A (weak fear) : Too bad. I really like it.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 3

A (neutral) : Hi, my name is Lean, and I'm from Russia.
B (neutral) : Nice to meet you, Lean. My name is Alike. I'm from Japan.
A (neutral) : To me English is a difficult language.
B (neutral) : A second language is always difficult.
A (neutral) : True, but English is harder than most. It's a crazy language.

Current turn
A (medium surprise) : A crazy language? Why do you say that?

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 4

A (neutral) : Good evening, may I help you?
B (neutral) : Yes. I would like to book a table for two at seven:thirty tomorrow evening.

Current turn
A (weak sad) : I am sorry, there are so many travelers that all our tables have been booked on that day.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 5

A (medium happy) : Beautiful weather, isn't it?
B (weak happy) : Yes, it is. Are you here on business?
A (weak happy) : No, I'm on a vacation to see the famous Three Gorges.
B (weak happy) : I'm going there for a tour, too. Is this your first trip to China?
A (medium happy) : Yes, it is.

Current turn
B (medium happy) : Why don't we go together? I can show you around. I think you'll have a better time.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 6

A (weak happy) : Well, everything is packed and ready to go.
B (medium happy) : It's hard to believe that we're really leaving. The past two weeks was like a dream.
A (medium happy) : Yes. Just think the blue sky, sunshine, mouth-watering food, centuries-old castles.
B (weak happy) : And the people were so friendly!
A (medium happy) : Yeah, we would have been lost without the help of the locals.
B (weak happy) : Do you still remember the small restaurant at the corner of the street?

Current turn
A (medium happy) : Of course. That was the best pasta I've ever had.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 7

A (medium happy) : Did you get a nice tree?
B (medium happy) : Sure did. It's a beauty. Where do you want it?
A (medium happy) : Let's put it over there.

Current turn
A (medium happy) : Let's go to work. We want to have the tree ready to light up by evening.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 8

B (neutral) : Excuse me, when is the next train to Los Angeles?
A (neutral) : Umm, ten-fifteen a.m..
B (neutral) : Can I get the ticket here?

Current turn
A (weak sad) : Sorry. You have to buy your ticket at the next counter.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)

Conversation 9

A (medium happy) : John, do you mind helping me prepare for the picnic?
B (medium happy) : No. Have you checked the weather report?
A (medium happy) : Yes. It says it will be sunny all day. No sign of rain at all.
B (medium happy) : I'd like some toast and chicken wings.

Current turn
A (weak happy) : OK. Please take some fruit salad and crackers for me.

Vocoded	FastSpeech 2	GRU-based
ECSS	JELLY	JELLY (speech_only)