Using CNN and LSTM neural networks for Arkhangelsk dialect word identification and classification
The study of dialects provides an opportunity to gain an understanding of the culture and history of a people, which are reflected in language. Dialectal vocabulary differs from standard vocabulary in terms of both meaning and pronunciation, as well as word formation and grammatical structures, primarily in morphology. Similar patterns can also be observed in the Arkhangelsk dialects. The aim of this paper is to develop a dialect words classifier, which can be used to identify dialect words within a given text and categorize them into one of the pre-defined groups. The novelty of this research lies in the lack of an automated system for classifying dialect words based on Arkhangelsk dialect materials. The article describes the development of a neural network for dialect words identification and classification. Dialect words were identified from dialect texts gathered during dialectological research conducted between the 1960s and the present day. LSTM (long short-term memory) and CNN (convolutional neural network) architectures are compared. One of the main challenges in the task of dialect word classification is that the neural network is trained using a limited amount of data. To overcome these limitations, we are investigating the possibility of using a bigram-based approach in addition to the unigram-based words encoding. A trained model that demonstrated the best results was integrated into our application for dialect words processing and analysis. Confusion matrix was constructed for the best model which demonstrates the highest performance for the derivational class and the lowest for the lexical class.
Figures
Shurykina, L. S., Latukhina, E. A., Petrova, T. V. (2024). Using CNN and LSTM Neural Networks for Arkhangelsk Dialect Word Identification and Classification, Research Result. Theoretical and Applied Linguistics, 10 (4), 106–125.
While nobody left any comments to this publication.
You can be first.
Adel, B., Eddine, M. C., Laouid, A., Chait, K. and Kara, M. (2024). Using Transformers to Classify Arabic Dialects on Social Networks, 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), El Oued, Algeria, 1–7. DOI: 10.1109/PAIS62114.2024.10541289 (In English)
Alali, M., Sharef, N., Murad, M. and Husin, N. A. (2019). Narrow Convolutional Neural Network for Arabic Dialects Polarity Classification, IEEE Access. 7. DOI: 10.1109/ACCESS.2019.2929208 (In English)
Arkhangelskiy, T. A. (2021). Application of the dialectometric method to the classification of Udmurt dialects, Uralo-altaiskie issledovaniya, 2(41), 7–20. DOI 10.37892/2500-2902-2021-41-2-7-20 (In Russian)
Azim, M. A., Hussein, W., Badr, N. (2022). Automatic Dialect identification of Spoken Arabic Speech using Deep Neural Networks, International Journal of Intelligent Computing and Information Sciences, 22, 4, 25–34. DOI: 10.21608/ijicis.2022.152368.1207 (In English)
Buckley, K. (2021). Uncovering linguistic lineage through using a character N-gram based dialect classifier, The languages of Scotland and Ulster in a global context, past and present. Selected papers from the 13th triennial Forum for Research on the Languages of Scotland and Ulster, Munich, Germany, 139–76. (In English)
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv preprint. DOI: 10.48550/arXiv.1810.04805 (In English)
Han, M., Zhu, D., Wen, X., Shu, L. and Yao, Z. (2024). Research on Dialect Protection: Interaction Design of Chinese Dialects Based on BLSTM-CRF and FBM Theories, IEEE Access, 12, 22059–22071. DOI: 10.1109/ACCESS.2024.3364098 (In English)
Høyland, B. and Nesse, A. (2023). Norwegian Dialect Classifications, Dialectologia, 10, 255–298. DOI: 10.1344/Dialectologia2022.2022.10 (In English)
Huang, T. J., Yang, J. Q., Shen, C., Liu, K. Q., Zhan, D. C. and Ye, H. J. (2024). Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens?, arXiv preprint. DOI: 10.48550/arXiv.2406.08477 (In English)
Karbysheva, D. Y. and Radchenko G. I. (2020). Types of dialectisms and methods of their translation into a foreign language (based on the novel M.A. Sholokhov’s ‘Quiet Don’), Eurasian Scientific Union, 8-5(66), 294–297. (In Russian)
Kethireddy, R., Kadiri, S. and Gangashetty, S. (2022). Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations, The Journal of the Acoustical Society of America, 151, 1077–1092. DOI: 10.1121/10.0009405 (In English)
Kolkova, D. E. (2023). Self-identification of personality through the use of dialect (based on the example of the Scottish dialect), Creative Linguistics: Collection of Scientific Articles, 6, 106–111. (In Russian)
Kornaukhova, T. V. and Goloshtanova, A. A. (2022). Reflection of modern realities in the dialects of the English language (based on the example of the Cockney dialect), Proceedings of the All-Russian scientific and practical conference “Avdeev Readings”, Penza, Russia, 90–94. (In Russian)
Kositsina, Y. (2024). Dialectisms in the modern regional dialect of the village of Usmanka, Chebulinsky District, Kemerovo Region, Philology. Theory Practice, 17, 1577–1583. DOI: 10.30853/phil20240228 (In Russian)
Kuparinen, O. (2024). Murre24: Dialect Identification of Finnish Internet Forum Messages, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 12003–12015. (In English)
Laith, B. and Kang, S. (2023). Transformer Text Classification Model for Arabic Dialects That Utilizes Inductive Transfer, Mathematics, 11, 4960. DOI: 10.3390/math11244960 (In English)
Mutalov, R. O. (2020). On the problem of distinguishing Dargin languages and dialects, The Newman in Foreign Policy, 6, 57 (101), 6–8. (In Russian)
Nenasheva, L. V. (2021). Each garment has its own clasp, Cuadernos De Rusística Española, 17, 211–221. DOI: 10.30827/cre.v17.21023 (In Russian)
Nenasheva, L. V. (2023). Tematicheskiy slovar arkhangelskih govorov [Thematic dictionary of Arkhangelsk dialects], Arkhangelsk: Limited Liability Company “Consulting Information Advertising Agency”, Arkhangelsk, Russia. (In Russian)
Nenasheva, L. V. and Shurykina, L. S. (2024). Electronic dictionary of Arkhangelsk dialects, Arctic and North, 55, 243–252. DOI: 10.37482/issn2221-2698.2024.55.243. (In Russian)
Purtova, G. M. (2023). Meyankieli: dialect or language?, Proceedings of the International scientific and practical conference “The World Historical and Cultural Heritage of the Arctic”, Saint-Petersburg, Russia, 27–28. (In Russian)
Ramachandran, P., Zoph, B., and Le, Q.V. (2017). Searching for Activation Functions, arXiv:1710.05941. Начало формы
Конец формы
DOI: 10.48550/arXiv.1710.05941 (In English)
Sainath, T. N., Vinyals, O., Senior, A. and Sak, H. (2015). Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Australia, 4580–4584. DOI: 10.1109/ICASSP.2015.7178838 (In English)
Samsitova, L. H. (2020). Dialect as a reflection of the linguistic picture of the world (using the example of the Northwestern dialect of the Bashkir language), Mir nauki, kul’tury, obrazovanija, № 6 (85), 474–476. DOI: 10.24412/1991-5500-2020-685-474-476 (In Russian)
Sciarretta A. (2024). Dialectometry-based classification of the Central–Southern Italian dialects, Journal of Linguistic Geography, 12 (1), 13–23. DOI: 10.1017/jlg.2024.7 (In English)
Shamshin, A. L. (2024). The role of knowledge of Italian dialects in intercultural communication: their importance for successful adaptation in Italy, Proceedings of VIII International scientific and methodological conference “Problems of teaching philological disciplines to foreign students”, Voronezh, Russia, 221–225.
Shurykina, L. S and Latukhina, E. A. (2023). Certificate of State Registration of the Computer Program No. 2023668038, 22 Aug 2023. (In Russian).
Shurykina, L. S and Latukhina, E. A. (2024). Automated creation of dialect dictionaries organization, Current Problems of Applied Mathematics, Informatics and Mechanics, Voronezh, Russia, 1017–1022. (In Russian)
Smetanina, Z. V. and Ivanova, G. A (2020). The variation of the word in the “Regional dictionary of Vyatka dialects”, Vestnik Tomskogo gosudarstvennogo universiteta, 451, 56–68. DOI: 10.17223/15617793/451/8 (In Russian)
Themistocleous, C. (2017). Dialect classification using vowel acoustic parameters. Speech Communication, 92, 13–22. (In English)
Themistocleous, C. (2019). Dialect Classification From a Single Sonorant Sound Using Deep Neural Networks, Frontiers in Communication, 4. DOI: 10.3389/fcomm.2019.00064 (In English)
Vernyaeva, R. A. and Zhdanova, E. A. (2023). Multimedia Corpus of Russian Dialects of Udmurtia: Electronic Subcorpus of Spoken Speech. Cuadernos De Rusística Española, 19, 67–79. DOI: 10.30827/cre.v19.28131 (In Russian)
Yamani, A., Alziyady, R., AlYami, R., Albelali, S., Albelali, L., Almulhim, J., Alsulami, A., Alfarraj, M., and Al-Zaidy, R. (2024). The kind dataset: A social collaboration approach for nuanced dialect data collection, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 32–43. (In English)
Ye, S., Zhao, R. and Fang, X. (2019). An Ensemble Learning Method for Dialect Classification, IOP Conference Series: Materials Science and Engineering, 569 052064. DOI: 10.1088/1757-899X/569/5/052064 (In English)
Zhang, Y. and Ren, W. (2022). From hǎo to hǒu – stylising online communication with Chinese dialects, International Journal of Multilingualism, 21 (1), 149–168. DOI: 10.1080/14790718.2022.2061981 (In English)
The research was supported by the Russian Science Foundation, project No. 23-28-01380, «Thematic dictionary of Arkhangelsk dialects with electronic support» (https://rscf.ru/project/23-28-01380/).