Speech Emotion Detection
Classifier
https://colab.research.google.com/drive/1JQp6gYZqeEQCURBYEiJKhEaDnZUXJLN6#scrollTo=FzDofBARYvcj
30 NARENDRA RAHMAN HANDWI
Why we need speech emotion
detection classifier?
●
In a work safety system, emotion recognition
can provide important information about the
mental state of workers, which can be used to
prevent work accidents
Dataset
●
ravdess-emotional-speech-audio
●
cremad
●
toronto-emotional-speech-set-tess
●
surrey-audiovisual-expressed-emotion-savee
●
These four datasets can be downloaded for free
from the internet
Importing Libraries
Data Preparation
1. ravdess-emotional-speech-audio dataframe
The code is a Python script that
is used to retrieve information
from a directory, in this case the
ravdess-emotional-speech-audio
directory.
2. cremad dataframe
The code is a Python script that is
used to retrieve information from a
directory, in this case the cremad
directory.
3. toronto-emotional-speech-set-tess dataset
The code is a Python script that is used
to retrieve information from a directory,
in this case the toronto-emotional-
speech-set-tess directory.
4. surrey-audiovisual-expressed-emotion-savee dataset
The audio files in this dataset are named in
such a way that the prefix letters describes
the emotion classes as follows:
'a' = 'anger' 'd' = 'disgust' 'f' = 'fear'
'h' = 'happiness' 'n' = 'neutral'
'sa' = 'sadness' 'su' = 'surprise'
Data Visualisation and Exploration
First let's plot the count of each
emotions in our dataset
Generates a count plot using the
Seaborn library to visualize the
distribution of different emotions in a
dataset
Plot Waveplots and Spectograms for Audio Signals (Fear)
Waveplots let us know the loudness
of the audio at a given time. A
spectrogram is a visual
representation of the spectrum of
frequencies of sound or other
signals as they vary with time. It’s a
representation of frequencies
changing with respect to time for
given audio/music signals
Plot Waveplots and Spectograms for Audio Signals (Angry)
Plot Waveplots and Spectograms for Audio Signals (Sad)
Plot Waveplots and Spectograms for Audio Signals (happy)
Data Augmentation
Data augmentation is the process by which we
create new synthetic data samples by adding small
perturbations on our initial training set. To generate
syntactic data for audio, we can apply noise
injection, shifting time, changing pitch and speed.
The objective is to make our model invariant to
those perturbations and enhace its ability to
generalize. In order to this to work adding the
perturbations must conserve the same label as the
original training sample. In images data
augmention can be performed by shifting the
image, zooming, rotating ... First, let's check which
augmentation techniques works better for our
dataset.
1. Simple Audio
Plots the waveform of an audio signal and
allows for interactive playback of the
corresponding audio file. This can be useful
for visualizing the structure of an audio signal
or for checking the quality of an audio file.
2. Noise Injection
Adds noise to an audio signal, plots its
waveform, and allows for interactive playback
of the resulting noisy signal. This can be
useful for simulating real-world audio
scenarios or testing the robustness of audio
processing algorithms to noise.
3. Stretching
Applies time stretching to an
audio signal, plots its waveform,
and allows for interactive
playback of the resulting signal.
4. Shifting
Shifts an audio signal by a
certain number of frames, plots
its waveform, and allows for
interactive playback of the
resulting shifted signal. This can
be useful for modifying the timing
or tempo of an audio signal or for
aligning multiple audio signals
together.
5. Pitch
Pitching is a useful feature in
voice recognition systems
because it can help distinguish
between different speakers,
identify emotional state, and aid
in speech recognition.
Feature Extraction
Extraction of features is a very
important part in analyzing and
finding relations between
different things. As we already
know that the data provided of
audio cannot be understood by
the models directly so we need
to convert them into an
understandable format for which
feature extraction is used
Feature Extraction (Cont.)
collects feature
and label data
from multiple
audio files &
manipulate and
analyze data in
tabular form.
Data Preparation
Modelling
Creates a convolutional neural
network (CNN) using the Keras
API for TensorFlow, and
compiles it for training with the
Adam optimizer and categorical
cross-entropy loss.
Train a Neural Network Model
Train a neural network
model with the
ReduceLROnPlateau
callback function to
optimize the learning rate
during training. By using
ReduceLROnPlateau, it can
help the model achieve
better convergence and
reduce the risk of overfitting
the training data.
Displays Model Accuracy on Test Data and Displays Graphs
of Loss and Accuracy in The Model Training Process
Predicting the labels of test data using a trained machine learning model,
and then converting the predicted labels and actual labels back to their
original categorical form.
Creating a Confusion Matrix and Then Visualizing
It Using a Heatmap
A confusion matrix is a
table that is often used to
evaluate the performance
of a classification model. It
shows the number of true
positives, true negatives,
false positives, and false
negatives predictions for
each class.
CONCLUSION
We can see our model is more
accurate in predicting surprise,
angry emotions and it makes
sense also because audio files
of these emotions differ to
other audio files in a lot of
ways like pitch, speed etc
We overall achieved 70%
accuracy on our test data and
its decent but we can improve
it more by applying more
augmentation techniques and
using other feature extraction
methods.