DL model to predict emotion behind a spoken sentence (Sentiment Analysis!)

In our previous blog (Part 1 of the Speech Emotion Recognition blog series), we explained the use case of sentimental analysis using speech and walked you through some of the real-world and machine learning problem statements. This is the second part of the speech emotion recognition blog series, and the following are the topics we will cover in this part.

1. First Cut Solution

2. Data Loading and Preprocessing

3. Exploratory Data Analysis

4. Feature Engineering

5. Modeling Preparation and Modeling

6. Results

7. Future Work

8. Profiles for Future Connect

9. References


Let’s get started.

1. First Cut Solution

a) Feature engineering:

I need to fetch data from different sources first and then club them together and see if the data is balanced or not and remove NAN if any. Then, I’d go for audio feature extraction as I am doing completely a speech recognition system using only audios, So, I’d go for MFCC and MEL features first before moving to other features.

b) Exploratory Data Analysis:

I’d observe which feature will be the most useful, and also need to see the biases of the dataset.

c) Data Normalization and standardization:

I am longing for the DL model and hence data scaling in both forms becomes inevitable.

d) Training and Testing the Model

We then go for training and testing wherein we split the data into 80% training and 20% testing most likely.

e) Deep learning model

I’d go for a CNN 1D model as the baseline model for my case study and will compare it with the RNN (LSTM) model to observe the results. I’ll use a SoftMax classifier and cross-entropy loss function for the classification.

f) Model evaluation

Error is the deviation of the values predicted by the model with the true values.

Below are the types of metrics we will be calculating for our deep classification model:

  • Confusion Matrix
  • F1 score
  • Precision
  • Recall

So overall, these are the tentative steps that I’ll take to come up with the final classification for sentiment analysis.

Now, let’s roll up the sleeves and jump to coding

2. Data Loading and Preprocessing

First, we load data through RAVDESS and TESS ZIP files and this is the code snippet to see what our data looks like:-

plt.figure(figsize=(7, 5))

plt.bar(labels.keys(), labels.values())

plt.title(‘labels data distribution’)


plt.ylabel(’emotions frequency’)


3. Exploratory Data Analysis

Analysis for Emotions (class label):

  1. calm emotion is the lowest in the count of 376
  2. happy, sad, angry, and fearful emotion has the highest count of 776
  3. neutral, disgust, and surprising emotion have an average value of appx. 590

So, overall there’s a slight imbalance in the dataset but not a huge one and we can move forward with feature extraction

Analysis for Audio Length:-

import seaborn as sns


ax = sns.boxplot(x= tess_aud_len)

plt.title(‘Tess data Audio length’)

plt.xlabel(‘audio duration’)


Similarly, we do it for RAVDESS dataset.


1. Tess audio length: min 1.25, max 2.90, IQR 1.75- 2.25, middle 2.10

2. Ravdess song audio length: min 3.70, max 5.10, IQR 4.35- 4.75, middle 4.60

3.Ravdess speech audio length: min 2.90, max 4.95, IQR 2.95- 3.80, middle 3.70

4. Feature Engineering

We’ll use librosa library for feature extraction. As per the research, the most used and important features for sound analysis are MFCC and Melspectogram

Following is the code snippet to extract MEL and MFCC features:

mel_tess =[]

mfcc_tess =[]

for tess in tqdm(os.listdir(‘/content/TESS Toronto emotional speech set data’)):

    for i in os.listdir(os.path.join(‘/content/TESS Toronto emotional speech set data’, tess)):

        y, sr = librosa.load(os.path.join(‘/content/TESS Toronto emotional speech set data’, tess, i))

        mel_tess.append(np.mean(librosa.feature.melspectrogram(y=y, sr=sr).T, axis= 0))

        mfcc_tess.append(np.mean(librosa.feature.mfcc(y=y, sr=sr, n_mfcc= 13).T, axis=0))



The final dataset has 5252 rows and a total of 141 as feature columns and we save the dataset to the CSV file format.


df_new.to_csv(‘/content/drive/MyDrive/Self Case Study/data.csv’)

Finally, this is how data looks after adding MFCC and MEL features:

5. Modeling Preparation and Modeling

5.1Random Splitting

We split the data in 80:20 ratio and to convert categories to number we use Label Encoder and dump the file to pickle:-

pickle.dump(lb, open(‘/content/drive/MyDrive/Self Case Study/label_encoder.pkl’, ‘wb’))

5.2 Minmax scaling

First we create minmax object:-

minmax = MinMaxScaler()

we do mixmax scaling to get the values in the range between 0 and 1.

The transformation is given by:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

x_train = scaler.fit_transform(x_train)

x_test = scaler.transform(x_test)

5.3 Modelling

After scaling the dataset, let’s jump to one of the most important sections of the case study i.e modeling.

The process that I have used here for modeling is that:-

1. We create a random model for setting our worst performance bars

2. Then, we create our baseline model, in this case, I have built a baseline 1D CNN Model

3. Then we try and create LSTM Model and compare our results with the baseline.

Let’s begin:-

5.3.1 Creating a random model

Random model will generate any random value from y for each data point.

# creating a random model

def random_model():

  return random.choice(y_train)

random model f1_score: 0.120

5.3.2 Creating our baseline model – 1D CNN:

We choose 1D CNN because

· As per the research, it is performing even better than RNN models for classification task

· It is good for text and numerical data as it is efficient.

Following is the code snippet for the 1D CNN architecture:-


model = Sequential()

model.add(Conv1D(64, kernel_size=(8), activation=’relu’, input_shape=(x_train.shape[1],1)))

model.add(Conv1D(128, kernel_size=(8),activation=’relu’))



model.add(Conv1D(128, kernel_size=(8),activation=’relu’))





model.add(Dense(256, activation=’relu’))


model.add(Dense(8, activation=’softmax’))


1D CNN Model f1_score: The f1_score for our baseline 1D CNN model is .80 which is a lot better than the random model’s f1_score of .12

Following is the precision plot on each epoch for 1D CNN Model:

The best model is saved as best_model.h5

5.3.3 LSTM Model Architecture

Now, let’s try LSTM as our data is sequence data.

input_array= keras.Input(shape=(x_train.shape[1],1), name = ‘input_array’)

lstm_layer = LSTM(512, name = ‘lstm_layer’)(input_array)

dense_1 = Dense(256, activation= ‘relu’, kernel_initializer=’he_normal’, name= ‘dense_1’)(lstm_layer)

dropout_1 = Dropout(rate= 0.2,  name= ‘dropout_1’)(dense_1)

dense_2 = Dense(128, activation= ‘relu’, kernel_initializer=’he_normal’,  name= ‘dense_2’)(dropout_1)

bn= BatchNormalization()(dense_2)

output_layer = Dense(8, activation= ‘softmax’, name= ‘output_layer’)(bn)

model_lstm = Model(inputs=input_array, outputs= output_layer)

model_lstm.compile(loss=’categorical_crossentropy’, optimizer=tf.keras.optimizers.Adam(lr=0.0001), metrics=[tensorflow.keras.metrics.Precision()])

history_lstm= model_lstm.fit(x_train, y_train,batch_size=32, epochs=200, validation_data=(x_test, y_test), callbacks= all)

LSTM Model f1_score : 0.735

Following is the precision plot for LSTM Model:

6. Results

Our baseline model 1D-CNN did not improve much after 30 epochs but it already gave an f1_score of .80 whereas our LSTM model even after 200 epochs could not cross .735 f1_score.

So clearly, 1D CNN is not only faster than LSTM here but also gives a lot better f1_score

We use Sklearn’s Classification_report method for error analysis and we do it for both LSTM and CNN models.

from sklearn.metrics import classification_report

print(classification_report(actual, prediction_lstm, target_names = classes))

Let’s check out 1D CNN model’s confusion matrix for better visualization:-

Now, let’s compare this with LSTM confusion matrix:

Conclusion: Comparing these two reports from CNN and LSTM, CNN performs much better on our KPI’s of precision, recall, and f1_score for almost all the classes which can also be observed from the above confusion matrices.

Error Analysis: When we dig deep into the CNN Model for error analysis, the major misclassification happens in sad (15), calm (14), and happy(11) sentiments and mainly because of the tone and pitch. Normally, the pitch for SAD is low, and the same is the case for CALM hence the misclassification happens here. Similarly happy and angry have high tones.

7. Future Work

We can try using spectrogram images for sentiment analysis using 2D CNN model architectures.

Kindly follow my GitHub profile for full code where I have built .py class for real-time query emotion classification.

8. Profiles

GitHub link: https://github.com/TarunLohchab/Speech_Emotion_Recognition

To connect with me:

LinkedIn link:- https://www.linkedin.com/in/tarun-lohchab-3080b41bb/

9. References

Hope this blog helped you gain a deeper understanding of some of the efficient speech emotion recognition solutions. We at Akaike technologies keep taking a deep dive into different problems in various fields which helps us customize the right solution for our clients. Keep visiting our website for more case studies, blogs, use cases, and other exciting work.