Real-time Speech Emotion Recognition (SER) Using Machine Learning

Unveil the hidden language of emotions through AI with Speech Emotion Recognition (SER) using machine learning. Discover the nuances in human speech with SER to detect various emotional states in real time:

Leveraging artificial intelligence to predict human emotions from an audio signal is known as speech emotion recognition. Voice recognizes emotions by analyzing the pitch and tone. Emotions such as anger, happiness, sadness, or frustration can be recognized from speech patterns.

Specifically, in the call centers, the SER is used to classify the calls based on emotions. This tool helps us determine if customers are satisfied with the services.

Table of Contents:

Speech Emotion Recognition: End-to-End Guide
Frequently Asked Questions
Conclusion

Speech Emotion Recognition: End-to-End Guide

YouTube-generated subtitles, transcripts for online courses, intelligent voice-assisted chatbots, and live speech transcripts are some of the simple speech recognition applications that are widespread.

Working of Speech Emotion Recognition Model

It is a collection of methodologies for isolating, extracting, and classifying speech signals. Audio processing techniques capture a hidden layer of information. It is used to amplify and extract tonal and acoustic features from the speech.

While doing the transformation, only the pivotal information is retained. The rest of the information contained in the audio format is abandoned. Sometimes, it becomes difficult for the models to classify the sample as they cannot learn the emotions.

[image source]

Here are a few steps that define the working of the speech recognition model:

Feature Extraction: In the speech signals, several parameters can be extracted. For example, pitch, formant, energy, and other spectrum features like mel-frequency cepstrum coefficients, linear prediction coefficient, and modulation.
Feature Selection: Using recursive feature elimination with linear regression, the number of features can be reduced at the stage of the final classification. The running time for algorithms can also be reduced. It also increases the accuracy level.
Classification Methods: For discrete emotion classification, many machine learning algorithms can be used. Through these algorithms, the samples can be trained and new observations can be classified. The performance of distinct classifiers can be compared by using these machine learning algorithms.

Apart from that, conversations about speech-to-text and direct speech recognition are more complex problems. Mapping the spoken words and sentences to textual counterparts is a big problem.

Suggested Read => Popular Text-to-Speech Software

Prominent Features

Speech signal sampling: It is the process of matching audio data with the expected sampling rate. If the dataset sampling rate is different, then the sampling process needs to be done to make the dataset uniform and consistent.
Preprocessing: In the preprocessing step, unwanted and undesirable noises from speech are removed. This process normalizes the length of the vocal tract. Pure signal is achieved by sampling the signal from the speaker.
- Turn Sounds into Numbers: The meaning of sound is a vibration that propagates across an acoustic wave. Firstly, the audio files are turned into audio waves and then into numbers. The numbers can be fed as a machine learning algorithm.
Feature extraction: For finding and analyzing relationships between different things, the extraction of features plays a vital role. By converting the data into an understandable format, the models can understand the data and get trained. Extracting the five features from the audio signal and fusing them into a vector is the primary function of feature extraction.
- Spectral_contrast: Leveraging method defined, compute the spectral contrast.
- Chorma-stft: Through a waveform or power spectrogram, the chronogram can be computed.
- Melspectogram: The mel-scaled power spectrogram can be computed.
- Tonnetz: The tonal centroid features can be computed.
- MFCC (Mel Frequency Cepstral Coefficients): Through the shape vocal tract manifests the short-time power spectrum. The primary job of MFCC is to accurately represent the envelope.
Classifier training: After formatting the data, it can be used for testing and training the models. For precision and accuracy of data, the training models are used. In the ratio of 20:80, mostly the data is divided. 20% is used for testing and 80% is utilized for training the model.
Output emotion tags: By putting an emotion tag, it is easy to predict boredom, excitement, amusement, fear, trust, etc. Based on many factors like gender, etc.

Data Pre-Processing

It is one of the foundation steps for classification and feature extraction. This step should be done effectively and the speech emotions dataset should be prepared before it. Pre-processing removes unwanted noise from audio signals.

IEMOP, SAVEE, RAVDESS, Emo-DB, and CASIA are the most common pre-processing data types used. IEMOP, SAVEE, and RAVDESS are audio-visual modalities while Emo-DB and CASIA are of speech modalities.

The emotion categories used in the dataset are anger, excitement, happiness, frustration, fear, sadness, surprise, and others. There are tools like OpenSMILE for feature extraction and data transformation.

Dataset:

RAVDESS: Leveraging the RAVDESS dataset, the Python mini project can be created. Ryerson Audio-Visual Database of emotional speech and song dataset is free to download. Having 7356 dataset files that contain emotional intensity, genuineness, and validity.
LJ Speech: From seven classic books dataset of 13,100 audio clips narrating passages can be gathered. The audio files vary from 1 to 10 seconds and are annotated text paired.
MS SNSD: It contains distinct environmental noises and clean speech files. Generating an augmented speech dataset can be mixed with clean speech.
Speech Accent Archive: With a dataset of 2140 speech samples collected from 177 countries, this dataset contains a diverse number of English accents.
Flickr 8K Audio Caption Corpus: In the form of wave audio format, it contains approximately 40000 annotated speech files.
Audio MNIST: From 60 distinct speakers, it contains 30000 audio clips of spoken digits (0-9).
TESS (Toronto Emotional Speech): A total of 2800 samples. It contains a set of phrases spread that includes emotions of anger, fear, surprise, sadness, disgust, and neural)
EMOVO Corpus: It contains 500 samples of six distinct emotional states.
CMU Multimodal (CMU-MOSI): From over 1000 speakers, it consists of a benchmark of 65 hours of labeled audio-video data that can be used for multimodal sentiment analysis.

Data Augmentation

The problems of not having enough datasets and data on different emotions are two common problems for speech emotion recognition. To balance both of these problems, a data augmentation strategy is used.

With data augmentation, the quantity of data can be increased as well and distinct emotion class data can be properly balanced. Through the data augmentation methodology, the model over-fitting problem can be resolved.

Based on the principle of Retinal Imaging Principle, it compares the classification performance between raw data and augmented data. For generating syntactic data, we can add the noise, shift the time, and change the pitch or speed of the audio. Through data augmentation, the model can be made invariant.

#1) Simple Audio
Syntax of code that can be used for simple audio:

plt.figure(figsize=(15,5))
Librosa.display.waveplot (x=data, sra = sample_rate)
Audio(path)

#2) Noise Injection
Syntax of code that can be used for noise injection:

y = noise(data)
plt.figure(figsize = (20,5))
librosa.display.waveplot(y=x, sra=sample_rate)
Audio(y, rate=sample_rate)

#3) Stretching
Syntax of code that can be used for stretching:

y= stretch(data)
plt.figure(figsize=(20,5))
librosa.display.waveplot(y=x, sra =sample_rate)
Audio(y, rate=sample_rate)

#4) Shifting
Syntax of code that can be used for shifting:

x = shift(data)
plt.figure(figsize = (20,5))
librosa.display.waveplot(y=x, sra=sample_rate)
Audio(y, rate=sample_rate)

#5) Pitching
Syntax of code that can be used for pitching:

x = pitch(sample_rate,data)
plt.figure(figsize = (20,5))
librosa.display.waveplot(x=y, sra = sample_rate)
Audio(y, rate = sample_rate)

Algorithms used for Speech Emotion Recognition

Gaussian mixture model, neural network, recurrent neural network, support vector machine, and hidden Markov model are some of the most popular classified algorithms used in speech mention recognition.

Top Five Speech Recognition Packages

DeepSpeech: Leveraging a custom dataset or by using pre-trained models, the audio files can be transcribed or fine-tuned. Based on Baidu’s DeepSpeech research, it is straightforward to implement.
Wav2letter: It is an open-source toolkit developed by Facebook. It is merged with the Flashlight library. With the help of frameworks, the automatic speech recognition model can be trained.
OpenSeq2Seq: For sequence-to-sequence translation problems, it is a research project done by NVIDIA. Automatic speech recognition is one of the key problems they’re trying to resolve.
Tensorflow ASR: It implements some of the benchmark models trained by using RNN with CTC.
Speech Recognition: Access to many automatic speech recognition models can be obtained through this package. It also includes wrappers for speech APIs used by Google, IBM, and Microsoft Azure.

What is Librosa

It is a Python library used for analyzing music and audio. With the features of backward compatibility, standardized names and interfaces, flatter package layout, and readable code, it is easy to install.

Leveraging the below Python libraries, the entire dataset can be imported :

import time
import numpy as np
import librosa
import joblib
import os

With the help of the Librosa library, the audio files can be loaded, displayed, and pre-processed for extracting the MFCC features.

For extracting the features using MFCC function

For subdir, dirs, files in os.walk (TESTING_FILES_PATH)
Try :
Y, sample_rate = librosa.load (os.path.join (dirs, file), res_type = ‘abcd’)
Mfccs = np.mean (librosa.feature.mfcc (x=y, sar = sample rate, n_mfcc = 30).T, axis =1)
File_class = int(file[8:9])-1
arr=mfccs,file_class
append(arr)
Continue

Leveraging joblib package, we can save these packages

y,x = zip (*nd)
y, x = np.asarray (y), np.asarray(x)
Print (y.shape, x.shape)
Joblib.dump (y, os.path.join (SAVE_DIRS_PATH, ‘y.joblib’)
Joblib.dump (x, os.path.join (SAVE_DIRS_PATH, ‘x.joblib’)

What is JupyterLab?

It is a web-based and open-source UI that provides the basic functionalities of file browsers, notebooks, text editors, terminals, rich output, and support for third-party extensions.

For running the code in JupyterLab, here are a few steps mentioned:

Type the command C:\\Users \ DataFlair >jupyterlab
After typing the command, a new session will open in the browser.
Type the code in the console.
After pressing Shift + Enter, the code can be executed.

Python Mini Project

Using the libraries soundfile, sklearn and librosa the model can be built using an MLP classifier. Through the classifier, emotions from sound files can be extracted. After loading and removing the features from the data, the entire dataset is divided for testing and training.

Steps for mini python project

Import: Importing the libraries.
Extracting: Using the extract_feature function mfcc, mel, and chroma features can be extracted from the sound file.
- mel: mel spectrogram frequency
- Chroma: It pertains to 12 different pitch classes
- mfcc: mel frequency cepstral coefficient. It represents the short-term power spectrum of a sound.
Creating a dictionary: Defining a list for holding the emotions and numbers available in the dataset.
Load the data: With the function load_data(), the data can be loaded. Leveraging the glob() function all the pathnames for the sound file in a dataset can be gained.
Classification: Leveraging an emotional dictionary, the number is turned into an emotion. With the help of function, the emotion is checked in the list of observed emotions.
Splitting the data: The data needs to be split for training and testing. Only 25% of the dataset is used for testing. The rest of the data is used for training.
Observe the dataset: The shape of a dataset for testing and training needs to be observed.
Feature Extraction: The features from the dataset are extracted.
Classifier: With the help of the MLP classifier, the classification can be done of the entire dataset.

Speech Recognition Project Ideas

Mood-Based Music Recommendation Engine: Leveraging the SER technique, a model that recognizes the mood using real-time speech samples can be developed. As per the emotions captured from the mood, the song gets played from a pre-defined list of songs.
Alexa Style Conversational Chatbot: With the help of Alexa – style agent whatever you will say will be listened and necessary actions will be taken. Adjacently, you can also design a customized conversational agent that can perform command-sensitive tasks such as sending emails, opening a particular software, etc.
Personalized Voice Recognition System: For validating the speaker, a binary classification is done to verify the specific person who is speaking. After successful validation, the conversational bot will continue the session. Based on voice samples, the SER (Speech Emotion Recognition) model is trained.

Common models for speech emotion recognition

Attention-based models: These models are based on deep learning techniques. Specific attention is given to focus on something particular and related to its specific importance. With the help of encoder-decoder architecture, the attention-based models are designed. Content-based, location-based, and hybrid-based attention models are developed on neural Turing machines.
Listen Attend Spell model: It is a neural network that transcribes the speech utterances to characters. It consists of two components i.e. listener and speller. The listener is a recurrent network encoder that takes bank spectra as inputs. On the other hand, the speller emits the characters as output.
RNN/LSTMs: For classifying, processing, and learning sequential data LSTM (Long short-term memory) is mostly used. Through a message-passing mechanism, the input dependencies are captured. With the help of graphs, accurate recognition of speech emotions can be done.

Applications of SER

The applications of speech motion recognition are highly useful. For security issues, the detection of lies at emergency call centers proves to be highly beneficial. In the field of digital image processing, facial video surveillance, computer vision, and face image processing speech emotion recognition has a wide range of applications based on human-computer interaction.

Education and Human Resources: In the education system, it is highly useful for detecting the level of interest in studies during classroom teaching. It assists the school management team in discovering new teaching methodologies for maximum engagement of students’ attention in the classroom. While in the profession of HR for unconscious bias removal, the emotion psychosomatic assessment proves to be beneficial.
For Public Safety: In the case of risk management, detecting the mood of the crowd, and for emergency control, the solution provides real-time information. Suspicious characters can be detected through an early warning. With the help of it, violent behavior and aggressiveness in individuals can be easily tracked. It highly assists the police force and other armed officers in detecting any abnormal event that can be harmful to the public.
Health Prediction: For predicting health symptoms like depression, cognitive decline, or anxiety the tool can easily detect stage first, second, or third level. For the pre-stroke and epilepsy, AI models are highly useful.

Frequently Asked Questions

1. How can we use CNN in speech emotion recognition?

For extracting and reducing the dimensions of features, CNN (Convolutional Neural Network) is used for classifying the features and extracting emotions from the speech a combination of error-correcting output codes and a gamma classifier is used.

2. With the help of an example how can emotion recognition be explained?

With the assistance of artificial intelligence, the emotions of a person can be recognized through facial expressions. There are mainly six different types of emotions that are recognized i.e. Anger, surprise, fear, sadness, happiness, and disgust.

3. How big is the RAVDESS dataset?

The Ryerson audio-visual database of emotional speech and song contains a total of 7356 files. The database consists of emotional speech and song. In the accent of North America, it contains 12 professional actors (12 male and 12 female).

4. Which libraries are used in speech emotion recognition?

Librosa, Sklearn, and Soundfile are the libraries used for building the MLP classifier. From the sound files, the emotions can be recognized through it. For popular speech APIs, the speech recognition library acts as a wrapper. For example, the Google Web Speech API acts as a default API and is hard-coded into a speech recognition library.

5. What is the future scope of speech emotion recognition?

For overall improving the performance, voice and facial expressions are used along with the body movements. Apart from that, with the enhancement of tools, better communication can be done between humans and robots.

6. What is the difference between CNN and RNN in speech recognition?

For image classification, medical analysis, and facial recognition, the CNN is used. While the RNN is used for sentiment analysis, machine translation, speech analysis, and natural language processing. Also in RNN the size of input and output can vary.

Further Reading => Notta Review – Text to Speech Transcription Tool

Conclusion

With the help of speech recognition systems, the cost can be reduced. It overall saves time, increases accuracy, and is easy to use. In the sectors of security, education, healthcare, medicine, and entertainment, speech emotion recognition has a vast number of applications.

Apart from this, with the help of emotional AI emotional state of a human being can be recognized. Appropriate support and resources can be provided for minimizing the risk factors. For automating processes, increasing accessibility for people with disabilities, and overall driving productivity, speech-emotion recognition plays a huge role.

Was this helpful?

Thanks for your feedback!