Распознавание эмоций по устной речи с использованием нейросетевого подхода
Aннотация
Устная речь человека всегда имеет эмоциональную окраску, это может быть обусловлено тем, что наши эмоции и наше настроение влияют на нашу речь. Когда мы рады, волнуемся, грустим или злимся, это отражается в нашем голосе, темпе и интонации. Невозможно говорить без эмоций, так как они являются неотъемлемой частью нашей личности и сопровождают нас повсюду. Наша устная речь становится еще богаче и выразительнее, когда мы передаем свои эмоции и чувства через слова. Однако оценка эмоционального состояния человека по речи может благотворно влиять на различные области жизнедеятельности, например, такие как медицина, психология, криминология, маркетинг и образование и многое другое. В медицине использование оценки эмоций по речи может помочь в диагностике и лечении психических расстройств, а также в мониторинге эмоционального состояния пациента, выявление на ранних стадиях таких болезней как Альцгеймер. В психологии этот метод может быть полезен для изучения эмоциональных реакций на различные стимулы и ситуации. В криминологии анализ речи и определение эмоций может использоваться для выявления ложных показаний и обмана. В маркетинге и рекламе это может помочь понять реакцию аудитории на продукт или рекламную кампанию. В образовании оценка эмоций по речи может быть использована для анализа эмоционального состояния студентов и оптимизации образовательного процесса. Таким образом, автоматизация процесса распознавания эмоций является перспективным направлением исследований, а применение различных методов машинного обучения и алгоритмов распознавания образов, можно сделать процесс более точным и эффективным.
В качестве инструмента для решения задачи распознавания паралингвистических явлений в виде эмоций по устной речи человека предлагается использовать нейросетевой подход, который показывает свою эффективность при решении задач в том случае, когда точное решение найти сложно. В работе представлена нейронная сеть сверточной архитектуры, позволяющая распознавать по устной речи четыре эмоции человека (грусть, радость, гнев, нейтраль). Особое внимание уделено формированию датасета для тренировки и тестирования модели, поскольку в настоящее время открытых баз речевых данных для исследования паралингвистических явлений (особенно на русском языке) практически нет. В данном исследовании используется база эмоциональной речи Dusha.
В качестве признаков для распознавания эмоций используются мел-спектрограммы речевого сигнала, что позволило увеличить процент распознавания и скорость работы нейронной сети по сравнению с использованием низкоуровневых дескрипторов.
Результаты экспериментов на тестовой выборке показали, что представленная нейронная сеть позволяет распознавать эмоции человека по устной речи в 75% случаев, что является высоким результатом.
В качестве дальнейших исследований предполагается тренировка и модернизация (при необходимости) представленной нейронной сети для распознавания паралингвистических явлений, не представленных в данном исследовании, например, таких как ложь, усталость, депрессия и др.
Ключевые слова: Речевые данные, Речевые базы данных, Нейронные сети, Сверточные нейронные сети, Распознавание эмоций, Классификация, Методы классификации
К сожалению, текст статьи доступен только на Английском
Introduction
The investigation of emotional manifestations of oral speech is one of the most complicated problems of modern humanities – not only linguistics itself, but also neurolinguistics, psycholinguistics, and, finally, cognitive science. This is due to several reasons: 1) the complexity of attributing a certain emotion as not always a clearly expressed mental phenomenon; 2) the integrated nature of its transmission by paralinguistic means as through the characteristics of the voice (lamprophony and pitch), speech rate, timbre, pausation and accentuation; 3) the multiplicity of methods and approaches to study emotionality, due to the above-mentioned “borderline” nature of the object of study (Balabanova, Abramov, 2023; Velichko et al., 2022; Santos et al., 2021; Chen et al., 2012).
Undoubtedly, the issue of determining a person’s emotional state does not only have a serious theoretical but also an obvious applied significance. Emotions of different modalities and degrees of intensity underlie the motive of any human activity, but assessing the emotional state of a linguistic personality is especially important for the areas of life activity that have a subject-object nature, where the main object and, simultaneously, the subject is a person, for example, for medicine, pedagogy, psychology, criminology, marketing, etc. Determining the emotional state can be used for various purposes in long-term communication: in medicine – for diagnosing and treating mental disorders in the early stages, in psychology – for studying emotional reactions to certain stimuli and situations, in pedagogy – for analyzing the emotional state of schoolchildren and students and optimizing the educational process. It is also important to study emotions in “one-time” communication: in criminology, for example, – to identify false testimony and deception, in marketing and advertising to determine the consumer’s reaction to a product or advertising campaign (Balabanova et al., 2023; Dellaert et al., 1996).
There is a serious need in the above-mentioned cases to increase the emotion recognition process and automate it, while the use of various machine learning methods and image recognition algorithms can make the process more accurate and efficient. The purpose of this work is to develop a neural network that allows to recognize human emotions from a speech signal. It is worth noting that there is an abundance of research in this area abroad but in Russia it is limited to a small number of such articles. When Russian researchers conduct work on recognizing emotions from a speech signal, English-language datasets are usually used. This article sets the task of training a neural network to recognize emotions from Russian and English speech, which allows to study a wider scope of application of the proposed solution in comparison with monolingual systems.
1. On the model of human emotional state
Speech emotion recognition (SER) is the determination of a person’s emotional state based on a speech signal without taking into account the semantic content. A person in the process of communication solves this problem quite effectively. However, at present, automatic classification of the speaker’s emotional state based on a speech signal is still relevant in various studies (Dvoynikova, Karpov, 2020; Fedotov et al., 2018; Shakhovsky, 2009).
One of the key points in creating SER is the choice of a model of a person’s emotional state. Currently, psychologists have developed many classifications of human emotions (Gorshkov, Dorofeev 2003; Grimm et al., 2007; Maysak, 2010).
Many scientists believe that the diversity of human emotions can be adequately represented by the model of emotions developed by James Russell (Russell et al., 2005).
James Russell developed a model based on subjective feelings. He used a statistical method to group emotion ratings based on positive correlations – essentially grouping similar words about emotions in a circle. This multidimensional scaling analysis revealed two bipolar dimensions – valence and arousal (Russell et al., 2005).
Thus, any emotion can be described by using the unpleasantness/pleasantness dimension (valence) and the high/low arousal dimension. One of the variations of Russell’s model is shown in Figure 1 (Russell et al., 2005).
Figure 1. J. Russell’s model of the human emotional state
Рисунок 1. Модель эмоционального состояния человека Д. Рассела
Russell’s (1980) model proposes that valence and arousal are independent bipolar dimensions. Independence means that valence and arousal are uncorrelated. Bipolarity means that opposite emotion terms represent each of the opposite poles of valence and arousal. For example, in Figure 1 above, “happy” and “sad” are shown at the opposite poles of the valence dimension. Similarly, Figure 1 shows that “excited” and “bored” are shown at the opposite ends of the arousal dimension (Russell et al., 2005).
Thus, a person cannot be excited and bored at the same time. Finally, according to this model, mixed emotions are similar in their subjective experience.
Therefore, a mixed emotion cannot consist of feelings that differ greatly in valence or arousal, such as happiness and sadness. In Figure 1, mixed emotional experiences are emotions that are located next to each other in the same quadrant.
2. On the algorithm for recognizing human emotions
The generalized algorithm for recognizing human emotions from a speech signal, shown in Figure 2, includes the following stages:
• Pre-processing of the audio signal: before starting to analyze the speech signal, it is necessary to pre-process the data, such as noise filtering, volume normalization and feature extraction from the audio file, etc.
• Feature extraction: to analyze a person’s emotional state from speech, various features are used, such as voice frequency, intonation, speech rate, phrase duration and others. These features can be extracted using special signal processing algorithms.
• Machine learning: at this stage of emotion recognition, machine learning methods are used to build a classifier.
• Emotion assessment: having a training model, it is possible to recognize the speaker’s emotional state from a speech signal based on feature analysis.
• Interpretation of results: it involves analyzing the results and using them for a specific task. For example, in control systems or improving human-machine interaction.
Figure 2. Generalized algorithm for recognizing emotions from speech
Рисунок 2. Обобщенный алгоритм распознавания эмоций по речи
In general, speech emotion recognition systems are analyzed from the point of view of pattern recognition in three areas (Lemaev, Lukashevich, 2024):
1) selection of an emotional speech database,
2) extraction of effective features,
3) development of reliable classifiers using machine learning algorithms.
However, it should be noted that data preprocessing algorithms can also significantly increase the quality of SER work.
The quality of solving the problem of emotion recognition by speech largely depends on the correctness of the approach to solving each of the above stages. However, the solution to each of the above subtasks is associated with a number of problems. Schematically, the problems of each stage and the ways to solve them are shown in Figure 3 (Fedotov et al., 2018; Dvoynikova, Karpov, 2020; Velichko et al., 2022).
Figure 3. Problems of SER construction
Рисунок 3. Проблемы построения SER
When choosing a dataset for training SER, it is necessary to take into account a number of factors. First of all, it is the quality of speech data in emotional speech. However, it should be noted that SER trained on a monolingual dataset will not provide high-quality recognition of speaker emotions in another language. Another aspect to be considered is the choice of speech parameters when constructing SER. Also, the choice of speech parameters is directly related to the choice of the classification method and vice versa. Just as the choice of the classification method, the choice of speech parameters is directly related to the amount of data in the dataset used.
Figure 4 schematically shows the strategy for choosing an approach to constructing SER depending on the availability of representative data for training and knowledge of the subject area (Fedotov et al., 2018; Dvoynikova, Karpov, 2020; Velichko et al., 2022).
Figure 4. Basic SER development strategies
Рисунок 4. Основные стратегии разработки SER
When building systems for recognizing paralinguistic phenomena in general and recognizing a person’s emotional state from speech in particular, the quantity and quality of the data, that can be used for training the model, play a major role. Using a dataset of emotional speech with a relatively small amount of data makes the use of neural networks impractical, since high-quality training of a neural network requires a much larger amount of data in relation to training models built by using classical classifiers. On the other hand, using a large amount of data for training does not imply the use of classical methods due to their large computational power. Another factor is the quality and quantity of the features of the speech signal used to recognize emotions. Thus, using a large number of features, especially if they contain those that do not carry significant information for recognizing emotions, is impractical when using classical classification methods, since unnecessary irrelevant information leads to a decrease in the quality of a classifier. On the contrary, provided there are functions that clearly reflect the features of emotions in speech, the use of classical classification methods is recommended. The neural network approach has proven itself to be effective in solving problems of paralinguistic speech analysis, since it demonstrates its effectiveness in solving problems when it is difficult to find an exact solution. Thus, in the presence of a large volume of emotional speech, deep neural networks can be used both to search for useful representations of emotional speech features and to directly recognize emotions. It should also be noted that paralinguistic speech analysis in general and emotion recognition by speech signal in particular is a relatively new area of research from the point of view of applied linguistics and from the point of view of speech signal processing, both in Russian and in world science. Therefore, it is advisable to conduct research on recognizing human emotions by oral speech in various directions (Albornoz et al., 2011; Ayadi et al., 2011; New et al., 2003; Fedotov et al., 2018; Dvoynikova, Karpov, 2020; Velichko et al., 2022).
3. On emotion recognition systems
Figure 5 provides a comprehensive overview of emotion recognition systems based on speech (Dvoynikova, Karpov, 2020; Balabanova et al., 2023; Abramov et al., 2024).
Figure 5. Overview of speech emotion recognition systems
Рисунок 5. Обзор систем распознавания речевых эмоций
Thus, when creating a system that allows to recognize emotions from a speech signal, it is necessary to consider three main aspects: the choice of the emotional speech base, the choice of speech signal features and the choice of a classification method.
I. Selecting a dataset of emotional speech.
Despite intensive research in corpus linguistics in the last decade, only a few speech datasets include emotional speech (Cowie et al., 2001; Sadiq et al., 2021; Sahoo, Routray, 2016; Nogueiras et al., 2001), but most of the existing emotional speech datasets are not publicly available. Therefore, researchers have to turn to proprietary datasets for prosody, emotion recognition studies.
In particular, very few studies on emotional intonation are devoted to the Russian language (Holden, Hogan, 1993; Hozjan, Kačič 2003; Siging, 2009).
However, it appears that the expression of emotions in Russian has both universal and language-specific features. Therefore, it poses a challenge for the theory of emotion recognition from speech (Makarova, 2000; Fedotov et al., 2018; Dvoynikova, Karpov, 2020). Thus, it seems appropriate to consider Russian-language corpora of emotional speech. Currently, there are three main corpora of Russian emotional speech: RUSLANA, RAMAS, and Dusha.
1) RUSLANA (RUSsian LANguage Affective speech) (Zeiler, Fergus, 2013).
A dataset of affective (emotional) utterances for the Russian language. The RUSLANA dataset contains recordings of 61 speakers (12 men and 49 women) who pronounce ten sentences neutrally (non-emotionally) and express the following five emotional states: surprise, happiness, anger, sadness, and fear. The average age of the speakers is 18.7 years, with a range from 16 to 28 years.
2) RAMAS (Russian Acted Multimodal Affective Set).
RAMAS is the first multimodal corpus in Russian. This dataset contains about 7 hours of high-quality video recordings of the subjects’ faces and speech (Perepelkina et al., 2018).
The dataset was created by engaging 10 semi-professional actors (5 men and 5 women) in acting out interactive dyadic scenarios. Each scenario included one of the basic emotions: anger, sadness, disgust, happiness, fear, or surprise, as well as some characteristics of social interaction, such as dominance and submission (Perepelkina et al., 2018).
In order to note the emotions that the subjects actually experienced during the process, the creators of the dataset asked them to fill out short questionnaires (self-reports) after each scenario. The recordings were labeled by 21 annotators (at least five annotators labeled each scenario) (Perepelkina et al., 2018).
RAMAS is an open dataset that provides the scientific community with multimodal data on the relationship between faces, speech, gestures, and physiology. In this paper, the focus is on the speech data recording and its labeling. RAMAS contains recordings of basic emotions: anger, sadness, disgust, happiness, fear and surprise (Perepelkina et al., 2018).
3) Dusha (Lemaev, Lukashevich, 2024).
Dusha is a bimodal corpus suitable for speech emotion recognition (SER) tasks.
This dataset was created by SberDevices and is currently the largest dataset in Russian designed to solve problems of recognizing emotions in spoken language.
The dataset is divided into 2 parts: for the first part, called Crowd, the authors generated texts based on the conversations of real people with a virtual assistant which were then voiced using crowdsourcing (the speakers were given a text and an emotion with which this text should be pronounced, and then the resulting audio recordings were additionally checked by a second group); the second part, called Podcast, contains short (up to 5 words) excerpts from Russian-language podcasts, which were then classified by emotion.
In total, 5 classes of emotions are presented: anger, sadness, positive, neutral, and others.
This dataset of emotional speech is the focal point of the research as it contains both the emotions generated by actors and those obtained in a natural environment.
II. Speech Signal Feature Selection for Emotion Detection (Kerkeni et al., 2020; Kim et al., 2017)
In fact, emotional feature extraction is the core problem in the SER system. Many researchers (Surabhi, Saurabh, 2016; Neiberg et al., 2006) have proposed important speech features that contain emotion information, such as energy, pitch, formant frequency, cepstral coefficients (LPCC and MFCC), and spectral features (Vu et al., 2021). Therefore, most researchers prefer to use a combined feature set consisting of a number of features containing more emotional information (Wu et al., 2011; Hsu et al. 2021). However, using a combined feature set may lead to high dimensionality and redundancy of speech features which complicates the training process for most machine learning algorithms and increases the likelihood of overfitting.
Thus, feature selection is necessary to reduce the redundancy of feature sizes.
Both feature extraction and feature selection can improve training performance, reduce computational complexity, build more generalizable models, as well as reduce the amount of memory required for storage.
III. Methods for recognizing emotions in a speech signal.
The key aspect of emotion recognition in speech is the selection of a classification method. Currently, many methods have been proposed, which are used both individually and in various combinations. New solutions for combining methods and neural network architectures constantly appear in open sources. However, as a first approximation, they are divided into two types in literature: classical methods and neural network methods (Uzdyaev, 2020; Wang et al., 2020).
4. Development of a neural network for emotion recognition
As noted above, the selection of an emotional speech base and its preprocessing is an important stage in building the SER system.
In this paper, the emotional speech dataset Dusha was chosen for our research. This choice was due to several factors:
1. Since the SER system being developed is intended for Russian-speaking users, the choice of the Dusha speech dataset is obvious, as its language is Russian.
2. This emotional speech dataset contains a significantly larger amount of data compared to other Russian-language emotional speech datasets.
3. The Dusha emotional speech dataset contains two types of records: those obtained in the laboratory (actors playing out emotions) and those obtained from real emotional dialogues.
4. Dusha is a free, open emotional speech dataset.
Preprocessing of speech data for building SER.
A random sample of 201,850 audio files was downloaded from the Dusha dataset (Lemaev, Lukashevich, 2024). During the sample analysis, it was found that preprocessing is needed. The purpose of preprocessing was to obtain high-quality data for training the model, taking into account the available resources for training models. Thus, it was necessary to reduce the number of audio recordings, while leaving those that are most indicative of the recognized emotions.
After preprocessing, the data set contained audio recordings of emotional speech that met the following criteria:
1. Correspondence of the label given by the expert to the emotion ordered by the actor.
Part of the Dusha data set was obtained in the laboratory, that is, using the actors’ performance with subsequent labeling by four experts which consisted of determining the emotion contained in the audio signal. Only those audio signals were used where the labels of all four experts and the label of the played emotion were matched. If at least one discrepancy was detected, the audio file was removed.
2. Audio file label presence.
Analyzing the original data set, it was discovered that some of the audio recordings contained the label “Other”, meaning – they did not relate to any of the emotions under consideration (neutral, anger, joy, sadness). Those audio recordings were excluded from the resulting dataset.
As a result, the dataset size after preprocessing was 29,698 audio files. The training and test sets were divided the following way: 27,256 audio files in the training dataset and 2,442 audio files in the test dataset.
At the next stage of preprocessing, a balanced dataset was obtained. Balancing was carried out by including in the resulting data set the same number of audio files of each of the emotions under consideration. Thus, the training data set included 6,814 audio recordings of each emotion (neutral, anger, joy, sadness). In this form, the resulting data set was used to solve the problem of recognizing anger using neural network methods. The next stage of preprocessing consisted of audio files size alignment, which was 3 seconds. The choice of such a parameter value is due to the available computing resources and sufficient time to express the emotion. Reduction to a size of 3 seconds was carried out as follows:
- from audio recordings lasting more than 3 seconds, 3-second fragments from the middle of the signal were selected, since, on average, this is the fragment of the most vivid expression of the emotion;
- zero values were added to audio signals lasting less than 3 seconds to the required duration. It should be noted that during the training, it was necessary to increase the recognition accuracy by adding not zeros to shorter audio recordings but by repeating a fragment of the same recording, since the semantic component of speech was not used in the developed emotion recognition algorithm. However, it did not increase the accuracy of emotion recognition and was rejected in further studies, since adding zeros carried a smaller computational load in relation to repeating informative fragments.
Features of a speech signal in recognizing emotions using neural network methods.
In order to recognize a person’s emotional state from oral speech using neural network methods, mel-spectrograms were used, which represent the energy characteristic of a speech signal over time. The Mel-scale was chosen, as it most adequately reflected the psychophysical perception of sound by a person (Abramov et al., 2024; Balabanova, Abramov, 2023).
The parameters used to construct a mel-spectrogram are presented in Table 1.
Table 1. Parameters for constructing a mel-spectrogram
Таблица 1. Параметры при построении mel-спектрограммы
The illustration of a spectrogram construction is shown in Figures 6 and 7.
Figure 6. Construction of a mel-spectrogram based on a speech signal
Рисунок 6. Построение mel-спектрограммы по речевому сигналу
Figure 7. Spectrogram of a speech signal fragment
Рисунок 7. Спектрограмма фрагмента речевого сигнала
As a result, for each speech signal from the dataset a Mel-spectrogram was obtained, the size of which throw time was 188 (the number of 1024-size windows on 48000 samples with a shift of 256 samples) by 80 (the number of Mel filters used).
Developing and testing the neural network.
The convolutional architecture was chosen as the basic architecture of the neural network developed for recognizing the emotional state of a person by one’s speech. This choice was attributed to several factors. Firstly, neural networks of the convolutional architecture require less resources for training and operation in general, which is important when using it with a limitation of computing resources. Secondly, it is planned to use a spectrogram of the speech signal as the initial data, and convolutional neural networks have historically been created to work with images and show excellent results in solving image classification problems (Raudys, 2003).
At the next stage of the experiment, it was decided to build a neural network based on the idea of VGG (Visual Geometry Group). This architecture was proposed by K. Simonyan and A. Zisserman from Oxford University in the article “Very Deep Convolutional Networks for Large-Scale Image Recognition”.
The main feature of the VGG idea is the presence of a block structure in the architecture of the convolutional neural network. That is, in each block of the network there are several consecutive conv layers and one Pooling.
The network architecture is shown in Figure 8.
Figure 8. Convolutional neural network
Рисунок 8. Нейронная сеть сверхточной архитектуры
Neural networks based on the VGG idea, according to the reference sources, have proven themselves well in solving the problem of classifying images from images. Since the input data are spectrograms, this architecture was chosen.
The main elements and characteristics of the presented neural network are shown in Table 2.
Table 2. Main elements and characteristics of the neural network
Таблица 2. Основные элементы и характеристики нейронной сети
The metrics used to evaluate the quality of the presented neural network were Precision, Recall and F1 score. The Recall metric shows whether the algorithm can detect a given class at all. That was the reasoning used when choosing that particular metric. The Precision metric shows the proportion of objects marked positive by the neural network which are actually positive. F1 score provides a balance between accuracy and completeness. These metrics are usually used in pairs to achieve better model evaluation during training. The calculation of the Precision, Recall and F1 score metrics for each class was carried out using the expressions:
(1)
(2)
(3)
where TP is the proportion of positive objects correctly predicted by positive ones, FN is the proportion of positive objects incorrectly predicted as negative ones (type II error, false rejection), FP is the proportion of negative objects incorrectly predicted as positive ones (type I error, false acceptance).
In order to assess the quality of emotion recognition in speech by a neural network vs the same task done by a human, confusion matrices of the accuracy of recognition by a human and a neural network were constructed based on the Dusha emotional speech dataset (Figure 9).
Figure 9. Confusion matrix for recognizing emotions from speech by a) a human, b) a neural network
Рисунок 9. Confusion matrix по распознаванию эмоций по речи а) человеком, б) нейронной сетью
The results of precision and recall metrics for emotion recognition in speech signal by a human and neural network are shown in Table 3.
Table 3. Comparison of the quality of emotion recognition by a human and a neural network
Таблица 3. Сравнение качества распознавания эмоций человеком и нейронной сетью
Analyzing the precision and recall metrics, as well as the error matrices, we can conclude that the developed neural network recognizes the considered emotions almost equally. The average precision and recall metrics indicate a higher quality of emotion recognition from oral speech by the developed neural network in comparison with human results.
Thus, the use of the developed neural network for recognizing emotions from oral speech can be used in various areas of human activity: in security systems, systems for analyzing conversations with clients in call centers, in smart home systems, in analyzing human conditions, in production, in diagnosing the initial stages of depression and other diseases, etc.
However, considering the solution to the problem of recognizing emotions from oral speech in the framework of interlingual communication, neural networks may cause some problems due to the fact that a model trained on a Russian-language dataset will produce low quality emotion recognition in another language. To investigate this issue, an English-language dataset (RAVDESS) containing emotional statements of the four classes was used. RAVDESS contains 1,440 files recorded by 24 professional actors (12 women and 12 men). Speech emotions include expressions of calmness, joy, sadness, anger, fear, surprise, and disgust. Each expression has two levels of emotional intensity (normal, strong) and an additional neutral expression (Hozjan, Kačič, 2003; Vu et al., 2021).
Figure 10 and Table 4 show the results of emotion recognition from English-language speech by a neural network trained on a Russian-language dataset.
Figure 10. Confusion matrix for recognizing emotions in English speech
Рисунок 10. Confusion matrix по распознаванию эмоций по англоязычной речи
Table 4. Quality of recognizing emotions in English speech when training a neural network on a Russian-language dataset
Таблица 4. Качество распознавания эмоций англоязычной речи при тренировке нейронной сети на русскоязычном датасете
The results of the experiment show that the use of a neural network trained on a monolingual dataset in interlingual communication is impractical, since the result of classifying a statement into a particular class does not exceed random distribution.
However, having trained the proposed model on the English-language dataset, the obtained emotion recognition results were close to the results of the Russian-language dataset.
The size of the total sample of the English-language dataset was 7472 statements, of which 5977 statements were used as a training sample, 1495 – a test sample. Preprocessing was carried out in the same way as with the Russian-language dataset.
The results are presented in the confusion matrix in Figure 11 and in Table 5.
Figure 11. Confusion matrix for recognizing emotions in English speech
Рисунок 11. Сonfusion matrix по распознаванию эмоций по англоязычной речи
Table 5. Quality of emotion recognition in English speech when training a neural network on an English-language dataset
Таблица 5. Качество распознавания эмоций англоязычной речи при тренировке нейронной сети на англоязычном датасете
Thus, when using algorithms based on the neural network approach to solve paralinguistic problems in general and to recognize emotions from oral speech in particular, the issue of interlingual communication should be considered.
At the first stage, it is possible to use a neural network that recognizes the speaker’s language (e.g., from Meta) as one of the solutions in interlingual communication, and then apply a version of the proposed neural network trained on the required language.
Conclusion
The article presents a neural network of convolutional architecture that allows to recognize the emotional state of a speaker directly from a speech signal without taking into account the semantic component. Particular attention is paid to the formation of a dataset for training and testing the model, since the quality of the developed SER directly depends on the quality of the processing of this stage. Mel-spectrograms of the speech signal being used as features for recognizing emotions made it possible to increase the recognition accuracy and the speed of the neural network compared to the low-level descriptors. The current paper presents a neural network of convolutional architecture that helps to recognize four human emotions (sadness, joy, anger, neutral) from speech. Particular attention is paid to the formation of a dataset for training and testing the model, since there are currently practically no open speech datasets for dealing with paralinguistic phenomena (especially in Russian). The results of experiments on the test dataset indicate that the proposed neural network successfully copes with the task of recognizing four emotions, demonstrating high performance (about 75%) in metrics such as perception, recall and F1score. It is also shown that the developed neural network can be used to recognize speaker emotions in interlingual communication.
However, it should be noted that while the proposed solution allows to identify only four emotions (Neutral, Sadness, Anger, Positive), other emotions can be recognized in a person's speech at the same time. Another obstacle to the correct recognition of emotion by speech may be a situation in which the emotion manifests itself very rapidly and lasts no more than a second. Also, the use of the presented SER is limited to two languages: Russian and English. Another factor negatively affecting the operation of the presented SER may be the presence of a noise component in the speech signal, for example, in the form of the white Gaussian noise. These aspects are the subject for further research.
Список литературы
Список использованной литературы появится позже.