Список литературы

2313-8912

Научный результат. Вопросы теоретической и прикладной лингвистики

2313-8912

10.18413/2313-8912-2024-10-4-0-2

3673

Большие языковые модели и промпт-инжиниринг в лингвистических исследованиях

<strong>Распознавание эмоций по устной речи с использованием нейросетевого подхода</strong>

<strong>Using neural network technologies in determining the emotional state of a person in oral communication</strong>

Балабанова

Татьяна Николаевна

Balabanova

Tatyana N.

Sozonova@bsu.edu.ru

Гайворонская

Диана Игоревна

Gaivoronskaya

Diana I.

trubitsyna@bsuedu.ru

Доборович

Анна Николаевна

Doborovich

Anna N.

doborovich@bsu.edu.ru

Белгородский государственный национальный исследовательский университетБелгородский государственный национальный исследовательский университет, Россия

2024

10400

Устная речь человека всегда имеет эмоциональную окраску, это может быть обусловлено тем, что наши эмоции и наше настроение влияют на нашу речь. Когда мы рады, волнуемся, грустим или злимся, это отражается в нашем голосе, темпе и интонации. Невозможно говорить без эмоций, так как они являются неотъемлемой частью нашей личности и сопровождают нас повсюду. Наша устная речь становится еще богаче и выразительнее, когда мы передаем свои эмоции и чувства через слова. Однако оценка эмоционального состояния человека по речи может благотворно влиять на различные области жизнедеятельности, например, такие как медицина, психология, криминология, маркетинг и образование и многое другое. В медицине использование оценки эмоций по речи может помочь в диагностике и лечении психических расстройств, а также в мониторинге эмоционального состояния пациента, выявление на ранних стадиях таких болезней как Альцгеймер. В психологии этот метод может быть полезен для изучения эмоциональных реакций на различные стимулы и ситуации. В криминологии анализ речи и определение эмоций может использоваться для выявления ложных показаний и обмана. В маркетинге и рекламе это может помочь понять реакцию аудитории на продукт или рекламную кампанию. В образовании оценка эмоций по речи может быть использована для анализа эмоционального состояния студентов и оптимизации образовательного процесса. Таким образом, автоматизация процесса распознавания эмоций является перспективным направлением исследований, а применение различных методов машинного обучения и алгоритмов распознавания образов, можно сделать процесс более точным и эффективным. В качестве инструмента для решения задачи распознавания паралингвистических явлений в виде эмоций по устной речи человека предлагается использовать нейросетевой подход, который показывает свою эффективность при решении задач в том случае, когда точное решение найти сложно. В работе представлена нейронная сеть сверточной архитектуры, позволяющая распознавать по устной речи четыре эмоции человека (грусть, радость, гнев, нейтраль). Особое внимание уделено формированию датасета для тренировки и тестирования модели, поскольку в настоящее время открытых баз речевых данных для исследования паралингвистических явлений (особенно на русском языке) практически нет. В данном исследовании используется база эмоциональной речи Dusha. В качестве признаков для распознавания эмоций используются мел-спектрограммы речевого сигнала, что позволило увеличить процент распознавания и скорость работы нейронной сети по сравнению с использованием низкоуровневых дескрипторов. Результаты экспериментов на тестовой выборке показали, что представленная нейронная сеть позволяет распознавать эмоции человека по устной речи в 75% случаев, что является высоким результатом. В качестве дальнейших исследований предполагается тренировка и модернизация (при необходимости) представленной нейронной сети для распознавания паралингвистических явлений, не представленных в данном исследовании, например, таких как ложь, усталость, депрессия и др.

Human oral speech often has an emotional connotation; this is due to the fact that emotions and our mood influence the physiology of the vocal tract and, as a result, speech. When a person is happy, worried, sad or angry, it is reflected in various characteristics of the voice, the pace of speech and its intonation. However, assessing a person’s emotional state through speech can have a beneficial effect on various areas of life, for example, medicine, psychology, criminology, marketing and education, etc. In medicine, the use of assessing emotions by speech can help in the diagnosis and treatment of mental disorders, as well as in monitoring the emotional state of the patient, identifying diseases such as Alzheimer’s in its early stages, diagnosing autism, etc. In psychology, this method can be useful for studying emotional reactions to various stimuli and situations. In criminology, speech analysis and emotion detection can be used to detect false statements and deception. In marketing and advertising, it can help understand consumer reactions to a product or advertising campaign. In education, assessing emotions from speech can be used to analyze the emotional state of students and optimize the educational process. Thus, automation of the emotion recognition process is a promising area of research, and the use of various machine learning methods and image recognition algorithms can make the process more accurate and efficient. In order to address the challenge of identifying paralinguistic expressions of emotion in human speech, it is proposed that a neural network approach be employed. This methodology has demonstrated efficacy in addressing complex problems where an exact solution may be elusive. The work presents a neural network of convolutional architecture that allows to recognize four human emotions (sadness, joy, anger, neutral) from spoken speech. Particular attention is paid to the formation of a dataset for training and testing the model, since at present there are practically no open speech databases for the study of paralinguistic phenomena (especially in Russian). This study uses the Dusha emotional speech database. Mel-spectrograms of the speech signal are used as features for recognizing emotions, which made it possible to increase the percentage of recognition and the speed of operation of the neural network compared to the use of low-level descriptors. The results of experiments in the test sample showed that the presented neural network helps to recognize human emotions from oral speech in 75% of cases, which is a high result. Further research involves training and upgrading (if necessary) the presented neural network to recognize paralinguistic phenomena not presented in this study, for example, lies, fatigue, depression, etc.

Речевые данныеРечевые базы данныхНейронные сетиСверточные нейронные сетиРаспознавание эмоцийКлассификацияМетоды классификации

Speech dataSpeech databasesNeural networksConvolutional neural networksEmotion recognitionClassificationClassification methods

Список литературы

Абрамов К. В., Балабанова Т. Н., Гайворонская Д. И. Использование нейронных сетей для распознавания агрессии по речевому сигналу // Информационные системы и технологии. 2024. № 2(142). С. 28–36.

Albornoz E. M., Milone D. H., Rufiner H. L. Spoken emotion recognition using hierarchical classifiers // Computer Speech & Language. 2011. №25 (3). Pp. 556–570.

Ayadi M. El., Kamel M. S., Karray F. Survey on speech emotion recognition: Features, classification schemes, and databases // Pattern Recognition.2011. №44 (3). Pp. 572–587.

Балабанова Т. Н., Абрамов К. В. Паралингвистический анализ для распознавания агрессии по речи человека // Наукоемкие технологии и инновации (XXV научные чтения): Сборник докладов Международной научно-практической конференции, Белгород, 23 ноября 2023 года. Белгород: Белгородский государственный технологический университет им. В.Г. Шухова. 2023. С. 697–700.

Балабанова Т. Н., Абрамов К. В., Болдышев А. В., Долбин Д. М. Автоматическое обнаружение гнева и агрессии в речевых сигналах // Экономика. Информатика. 2023. №50 (4). С. 944–954. DOI: 10.52575/2687-0932-2023-50-4-944-954

Chen L., Mao X., Xue Y., Cheng L. L. Speech emotion recognition: Features and classification models // Digital Signal Processing. 2012. №22 (6). Pp. 1154–1160.

Cowie R., Douglas-Cowie E., Tsapatsoulis N., Votsis G., Kollias S., Fellenz W., Taylor J. G. Emotion recognition in human-computer interaction // IEEE Signal Processing Magazine. 2001. №18 (1). Pp. 32–80.

Dellaert F., Polzin T., Waibel A. Recognizing emotion in speech // Recognizing emotion in speech, Proceeding of Fourth International Conference on Spoken Language Processing (ICSLP). 1996. Pp. 1970–1973.

Двойникова А. А., Карпов А. А. Аналитический обзор подходов к распознаванию тональности русскоязычных текстовых данных // Информационно-управляющие системы. 2020. № 4 (107). С. 20–30. DOI: 10.31799/1684-8853-2020-4-20-30

Fedotov, D., Kaya, H., Karpov A. Context Modeling for Cross-Corpus Dimensional Acoustic Emotion Recognition: Challenges and Mixup // Proceedings of 20th International Conference on Speech and Computer (SPECOM-2018). 2018. C. 155–165. DOI: 10.1007/978-3-319-99579-3_17

Горшков Ю. Г., Дорофеев А. В. Речевые детекторы лжи коммерческого применения // Информационный мост (ИНФОРМОСТ). Радиоэлектроника и Телекоммуникация. 2003. №6. С. 13–15.

Grimm M., Kroschel K., Mower E., Narayanan S. Primitives-based evaluation and estimation of emotions in speech // Speech Communication. 2007. №49 (10–11). Pp. 787–800.

Holden K. T., Hogan J. T. The emotive impact of foreign intonation: An experiment in switching English and Russian intonation // Language and Speech. 1993. №36 (1). Pp. 67–88.

Hozjan V., Kačič, Z. Context-Independent Multilingual Emotion Recognition from Speech Signals // International Journal of Speech Technology. 2003. №6. Pp. 311–320.

Hsu W. N., Bolte B., Tsai Y.-H. H., Lakhotia K., Salakhutdinov R., Mohamed A.-r. Hubert: Self-supervised speech representation learning by masked prediction of hidden units // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021. №29. Pp. 3451–3460.

Kerkeni L., Serrestou Y., Mbarki M., Raoof K., Ali Mahjoub M., Cleder C. Social Media and Machine Learning. Virginia Commonwealth University, United States of America: IntechOpen, 2020. С. 96 с. DOI: 10.5772/intechopen.78089

Kim J., Truong K. P., Englebienne G, Evers V. Learning spectro-temporal features with 3D CNNs for speech emotion recognition // Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction (ACII), 2017. Pp. 383–388. DOI:10.1109/ACII.2017.8273628

Лемаев В. И., Лукашевич Н. В. Автоматическая классификация эмоций в речи: методы и данные // Litera. 2024. № 4. С. 159–173. DOI: 10.25136/2409-8698.2024.4.70472

Makarova, V. Acoustic cues of surprise in Russian questions // Journal of the Acoustical Society of Japan (E). 2000. №21 (5). Pp. 243–250. DOI: 10.1250/ast.21.243

Майсак Н. В. Матрица социальных девиаций: классификация типов и видов девиантного поведения // Современные проблемы науки и образования. 2010. № 4. С. 78–86.

Neiberg D., Elenius K., Laskowski K. Emotion recognition in spontaneous speech using GMMs // INTERSPEECH 2006 – ICSLP, Ninth International Conference on Spoken Language Processing. 2006. Pp. 809–812.

New T. L., Foo S. W., De Silva L. C. Speech emotion recognition using hidden Markov models // Speech Communication. 2003. №41 (4). Pp. 603–623.

Nogueiras A., Moreno A., Bonafonte A., Mariño J.B. Speech emotion recognition using hidden Markov models // Proceedings of EUROSPEECH 2001, 7th European conference on speech communication and technology. 2001. Pp. 746–749.

Perepelkina O., Kazimirova E., Konstantinova M. RAMAS: Russian Multimodal Corpus of Dyadic Interaction for studying emotion recognition // PeerJ Preprints. 6:e26688v1. 2018. https://doi.org/10.7287/peerj.preprints.26688v1

Russell, J. A, Posner, J., Peterson, B. S. The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology // Dev Psychopathol.2005. 17 (3), Pp. 715–734. DOI: 10.1017/S0954579405050340.

Raudys S. On the universality of the single-layer perceptron model // Neural Networks and Soft Computing. Physica, Heidelberg. 2003. Pp. 79–86.

Sadiq S., Mehmood A., Ullah S., Ahmad M., Sang Choi G., On B.-W. Aggression detection through deep neural model on twitter // Future Generation Computer Systems. 2021. №114. Pp. 120–129.

Sahoo S., Routray A. Detecting aggression in voice using inverse filtered speech features // IEEE Transactions on Affective Computing. 2016. №9 (2). Pp. 217–226. DOI: 10.1109/TAFFC.2016.2615607

Santos F., Durães D., Marcondes F. M., Hammerschmidt N., Lange S., Machado J., Novais P. In-car violence detection based on the audio signal // Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning. Springer. 2021. Pp. 437–445. https://doi.org/10.1007/978-3-030-91608-4_43

Шаховский В. И. Эмоции как объект исследования в лингвистике // Вопросы психолингвистики. 2009. № 9. С. 29–43.

Siging W. Recognition of human emotion in speech using modulation spectral features and support vector machines: магистерская диссертация / Siqing Wu ; Department of Electrical and Computer Engineering Queen’s University. Kingston, Ontario, Canada. 2009. С. 126

Surabhi V., Saurabh M. Speech emotion recognition. A review // International Research Journal of Engineering and Technology (IRJET). 2016. №03. Pp. 313–316.

Светозарова Н. Д. Интонационная система русского языка. Л.: Изд-во Лен. ун-та. 1982. 176 с.

Уздяев М. Ю. Нейросетевая модель многомодального распознавания человеческой агрессии // Вестник КРАУНЦ. Физико-математические науки. 2020. Т. 33. №. 4. С. 132–149.

Velichko A., Markitantov M., Kaya H., Karpov A. Complex Paralinguistic Analysis of Speech: Predicting Gender, Emotions and Deception in a Hierarchical Framework // Proceedings of Interspeech. 2022. Pp. 4735–4739. DOI:10.21437/Interspeech.2022-11294.

Vu M.T., Beurton-Aimar M., Marchand S. Multitask multi-database emotion recognition // Proceedings of IEEE/CVF International Conference on Computer Vision. 2021. Pp. 3637–3644. DOI:10.1109/ICCVW54120.2021.00406

Wang J., Xue M., Culhane R., Diao E., Ding J., Tarokh V. Speech emotion recognition with dual-sequence LSTM architecture // Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. Pp. 6474–6478. DOI:10.1109/ICASSP40776.2020.9054629

Wu S, Falk T. H., Chan W. Y. Automatic speech emotion recognition using modulation spectral features // Speech Communication. 2011. № 53. Pp. 768–785.

Zeiler M. D., Fergus R. Visualizing and understanding convolutional networks // Computer Vision and Pattern Recognition (ECCV 2014). 2013. Pp. 818–833. DOI: 10.48550/arXiv.1311.2901