<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<article article-type="research-article" dtd-version="1.2" xml:lang="ru" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><front><journal-meta><journal-id journal-id-type="issn">2313-8912</journal-id><journal-title-group><journal-title>Research Result. Theoretical and Applied Linguistics</journal-title></journal-title-group><issn pub-type="epub">2313-8912</issn></journal-meta><article-meta><article-id pub-id-type="doi">10.18413/2313-8912-2024-10-4-0-6</article-id><article-id pub-id-type="publisher-id">3677</article-id><article-categories><subj-group subj-group-type="heading"><subject>Large Language Models and Prompt Engineering in Linguistics</subject></subj-group></article-categories><title-group><article-title>&lt;strong&gt;Using CNN and LSTM neural networks&amp;nbsp;&amp;nbsp;for Arkhangelsk dialect word identification and classification&lt;/strong&gt;</article-title><trans-title-group xml:lang="en"><trans-title>&lt;strong&gt;Using CNN and LSTM neural networks&amp;nbsp;&amp;nbsp;for Arkhangelsk dialect word identification and classification&lt;/strong&gt;</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Shurykina</surname><given-names>Lyudmila S.</given-names></name><name xml:lang="en"><surname>Shurykina</surname><given-names>Lyudmila S.</given-names></name></name-alternatives><email>l.shurykina@narfu.ru</email><xref ref-type="aff" rid="aff1" /></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Latukhina</surname><given-names>Ekaterina A.</given-names></name><name xml:lang="en"><surname>Latukhina</surname><given-names>Ekaterina A.</given-names></name></name-alternatives><email>e.latukhina@narfu.ru</email><xref ref-type="aff" rid="aff2" /></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Petrova</surname><given-names>Tatyana V.</given-names></name><name xml:lang="en"><surname>Petrova</surname><given-names>Tatyana V.</given-names></name></name-alternatives><email>t.petrova@narfu.ru</email><xref ref-type="aff" rid="aff2" /></contrib></contrib-group><aff id="aff2"><institution>Northern (Arctic) Federal University named after M.V. Lomonosov, Arkhangelsk, Russia</institution></aff><aff id="aff1"><institution>Northern (Arctic) Federal University named after M.V. Lomonosov, Arkhangelsk, Russia.</institution></aff><pub-date pub-type="epub"><year>2024</year></pub-date><volume>10</volume><issue>4</issue><fpage>0</fpage><lpage>0</lpage><self-uri content-type="pdf" xlink:href="/media/linguistics/2024/4/Research_Result_4-42-120-139.pdf" /><abstract xml:lang="ru"><p>The study of dialects provides an opportunity to gain an understanding of the culture and history of a people, which are reflected in language. Dialectal vocabulary differs from standard vocabulary in terms of both meaning and pronunciation, as well as word formation and grammatical structures, primarily in morphology. Similar patterns can also be observed in the Arkhangelsk dialects. The aim of this paper is to develop a dialect words classifier, which can be used to identify dialect words within a given text and categorize them into one of the pre-defined groups. The novelty of this research lies in the lack of an automated system for classifying dialect words based on Arkhangelsk dialect materials. The article describes the development of a neural network for dialect words identification and classification. Dialect words were identified from dialect texts gathered during dialectological research conducted between the 1960s and the present day. LSTM (long short-term memory) and CNN (convolutional neural network) architectures are compared. One of the main challenges in the task of dialect word classification is that the neural network is trained using a limited amount of data. To overcome these limitations, we are investigating the possibility of using a bigram-based approach in addition to the unigram-based words encoding. A trained model that demonstrated the best results was integrated into our application for dialect words processing and analysis. Confusion matrix was constructed for the best model which demonstrates the highest performance for the derivational class and the lowest for the lexical class.



</p></abstract><trans-abstract xml:lang="en"><p>The study of dialects provides an opportunity to gain an understanding of the culture and history of a people, which are reflected in language. Dialectal vocabulary differs from standard vocabulary in terms of both meaning and pronunciation, as well as word formation and grammatical structures, primarily in morphology. Similar patterns can also be observed in the Arkhangelsk dialects. The aim of this paper is to develop a dialect words classifier, which can be used to identify dialect words within a given text and categorize them into one of the pre-defined groups. The novelty of this research lies in the lack of an automated system for classifying dialect words based on Arkhangelsk dialect materials. The article describes the development of a neural network for dialect words identification and classification. Dialect words were identified from dialect texts gathered during dialectological research conducted between the 1960s and the present day. LSTM (long short-term memory) and CNN (convolutional neural network) architectures are compared. One of the main challenges in the task of dialect word classification is that the neural network is trained using a limited amount of data. To overcome these limitations, we are investigating the possibility of using a bigram-based approach in addition to the unigram-based words encoding. A trained model that demonstrated the best results was integrated into our application for dialect words processing and analysis. Confusion matrix was constructed for the best model which demonstrates the highest performance for the derivational class and the lowest for the lexical class.



</p></trans-abstract><kwd-group xml:lang="ru"><kwd>Dialect word classification</kwd><kwd>Natural language processing</kwd><kwd>Convolutional neural network</kwd><kwd>Long short-term memory</kwd></kwd-group><kwd-group xml:lang="en"><kwd>Dialect word classification</kwd><kwd>Natural language processing</kwd><kwd>Convolutional neural network</kwd><kwd>Long short-term memory</kwd></kwd-group></article-meta></front><back><ack><p>The research was supported by the Russian Science Foundation, project No. 23-28-01380, &amp;laquo;Thematic dictionary of Arkhangelsk dialects with electronic support&amp;raquo; (https://rscf.ru/project/23-28-01380/).</p></ack><ref-list><title>Список литературы</title><ref id="B1"><mixed-citation>Adel,&amp;nbsp;B., Eddine,&amp;nbsp;M.&amp;nbsp;C., Laouid,&amp;nbsp;A., Chait,&amp;nbsp;K. and Kara,&amp;nbsp;M. (2024). Using Transformers to Classify Arabic Dialects on Social Networks, 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), El Oued, Algeria, 1&amp;ndash;7. DOI: 10.1109/PAIS62114.2024.10541289 (In English)</mixed-citation></ref><ref id="B2"><mixed-citation>Alali,&amp;nbsp;M., Sharef,&amp;nbsp;N., Murad,&amp;nbsp;M. and Husin, N.&amp;nbsp;A. (2019). Narrow Convolutional Neural Network for Arabic Dialects Polarity Classification, IEEE Access. 7. DOI: 10.1109/ACCESS.2019.2929208 (In English)</mixed-citation></ref><ref id="B3"><mixed-citation>Arkhangelskiy, T.&amp;nbsp;A. (2021). Application of the dialectometric method to the classification of Udmurt dialects, Uralo-altaiskie issledovaniya, 2(41), 7&amp;ndash;20. DOI 10.37892/2500-2902-2021-41-2-7-20 (In Russian)</mixed-citation></ref><ref id="B4"><mixed-citation>Azim, M.&amp;nbsp;A., Hussein,&amp;nbsp;W., Badr,&amp;nbsp;N. (2022). Automatic Dialect identification of Spoken Arabic Speech using Deep Neural Networks, International Journal of Intelligent Computing and Information Sciences, 22, 4, 25&amp;ndash;34. DOI: 10.21608/ijicis.2022.152368.1207 (In English)</mixed-citation></ref><ref id="B5"><mixed-citation>Buckley,&amp;nbsp;K. (2021). Uncovering linguistic lineage through using a character N-gram based dialect classifier, The languages of Scotland and Ulster in a global context, past and present. Selected papers from the 13th triennial Forum for Research on the Languages of Scotland and Ulster, Munich, Germany, 139&amp;ndash;76. (In English)</mixed-citation></ref><ref id="B6"><mixed-citation>Devlin,&amp;nbsp;J., Chang,&amp;nbsp;M.-W., Lee,&amp;nbsp;K. and Toutanova,&amp;nbsp;K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv preprint. DOI: 10.48550/arXiv.1810.04805 (In English)</mixed-citation></ref><ref id="B7"><mixed-citation>Han,&amp;nbsp;M., Zhu,&amp;nbsp;D., Wen,&amp;nbsp;X., Shu,&amp;nbsp;L. and Yao,&amp;nbsp;Z. (2024). Research on Dialect Protection: Interaction Design of Chinese Dialects Based on BLSTM-CRF and FBM Theories, IEEE Access, 12, 22059&amp;ndash;22071. DOI: 10.1109/ACCESS.2024.3364098 (In English)</mixed-citation></ref><ref id="B8"><mixed-citation>H&amp;oslash;yland,&amp;nbsp;B. and Nesse,&amp;nbsp;A. (2023). Norwegian Dialect Classifications, Dialectologia, 10, 255&amp;ndash;298. DOI: 10.1344/Dialectologia2022.2022.10 (In English)</mixed-citation></ref><ref id="B9"><mixed-citation>Huang, T. J., Yang, J. Q., Shen, C., Liu, K. Q., Zhan, D. C. and Ye, H. J. (2024). Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens?, arXiv preprint. DOI: 10.48550/arXiv.2406.08477 (In English)</mixed-citation></ref><ref id="B10"><mixed-citation>Karbysheva,&amp;nbsp;D.&amp;nbsp;Y. and Radchenko&amp;nbsp;G.&amp;nbsp;I. (2020). Types of dialectisms and methods of their translation into a foreign language (based on the novel M.A. Sholokhov&amp;rsquo;s &amp;lsquo;Quiet Don&amp;rsquo;), Eurasian Scientific Union, 8-5(66), 294&amp;ndash;297. (In Russian)</mixed-citation></ref><ref id="B11"><mixed-citation>Kethireddy,&amp;nbsp;R., Kadiri,&amp;nbsp;S. and Gangashetty,&amp;nbsp;S. (2022). Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations, The Journal of the Acoustical Society of America, 151, 1077&amp;ndash;1092. DOI: 10.1121/10.0009405 (In English)</mixed-citation></ref><ref id="B12"><mixed-citation>Kolkova,&amp;nbsp;D.&amp;nbsp;E. (2023). Self-identification of personality through the use of dialect (based on the example of the Scottish dialect), Creative Linguistics: Collection of Scientific Articles, 6, 106&amp;ndash;111. (In Russian)</mixed-citation></ref><ref id="B13"><mixed-citation>Kornaukhova,&amp;nbsp;T.&amp;nbsp;V. and Goloshtanova,&amp;nbsp;A.&amp;nbsp;A. (2022). Reflection of modern realities in the dialects of the English language (based on the example of the Cockney dialect), Proceedings of the All-Russian scientific and practical conference &amp;ldquo;Avdeev Readings&amp;rdquo;, Penza, Russia, 90&amp;ndash;94. (In Russian)</mixed-citation></ref><ref id="B14"><mixed-citation>Kositsina,&amp;nbsp;Y. (2024). Dialectisms in the modern regional dialect of the village of Usmanka, Chebulinsky District, Kemerovo Region, Philology. Theory Practice, 17, 1577&amp;ndash;1583. DOI: 10.30853/phil20240228 (In Russian)</mixed-citation></ref><ref id="B15"><mixed-citation>Kuparinen,&amp;nbsp;O. (2024). Murre24: Dialect Identification of Finnish Internet Forum Messages, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 12003&amp;ndash;12015. (In English)</mixed-citation></ref><ref id="B16"><mixed-citation>Laith,&amp;nbsp;B. and Kang,&amp;nbsp;S. (2023). Transformer Text Classification Model for Arabic Dialects That Utilizes Inductive Transfer, Mathematics, 11, 4960. DOI: 10.3390/math11244960 (In English)</mixed-citation></ref><ref id="B17"><mixed-citation>Mutalov,&amp;nbsp;R.&amp;nbsp;O. (2020). On the problem of distinguishing Dargin languages and dialects, The Newman in Foreign Policy, 6, 57&amp;nbsp;(101), 6&amp;ndash;8. (In Russian)</mixed-citation></ref><ref id="B18"><mixed-citation>Nenasheva,&amp;nbsp;L.&amp;nbsp;V. (2021). Each garment has its own clasp, Cuadernos De Rus&amp;iacute;stica Espa&amp;ntilde;ola, 17, 211&amp;ndash;221. DOI: 10.30827/cre.v17.21023 (In Russian)</mixed-citation></ref><ref id="B19"><mixed-citation>Nenasheva,&amp;nbsp;L.&amp;nbsp;V. (2023). Tematicheskiy slovar arkhangelskih govorov [Thematic dictionary of Arkhangelsk dialects], Arkhangelsk: Limited Liability Company &amp;ldquo;Consulting Information Advertising Agency&amp;rdquo;, Arkhangelsk, Russia. (In Russian)</mixed-citation></ref><ref id="B20"><mixed-citation>Nenasheva,&amp;nbsp;L.&amp;nbsp;V. and Shurykina,&amp;nbsp;L.&amp;nbsp;S. (2024). Electronic dictionary of Arkhangelsk dialects, Arctic and North, 55, 243&amp;ndash;252. DOI: 10.37482/issn2221-2698.2024.55.243. (In Russian)</mixed-citation></ref><ref id="B21"><mixed-citation>Purtova,&amp;nbsp;G.&amp;nbsp;M. (2023). Meyankieli: dialect or language?, Proceedings of the International scientific and practical conference &amp;ldquo;The World Historical and Cultural Heritage of the Arctic&amp;rdquo;, Saint-Petersburg, Russia, 27&amp;ndash;28. (In Russian)</mixed-citation></ref><ref id="B22"><mixed-citation>Ramachandran,&amp;nbsp;P., Zoph,&amp;nbsp;B., and Le,&amp;nbsp;Q.V. (2017). Searching for Activation Functions, arXiv:1710.05941. Начало формы</mixed-citation></ref><ref id="B23"><mixed-citation>Конец формы</mixed-citation></ref><ref id="B24"><mixed-citation>&amp;nbsp;DOI: 10.48550/arXiv.1710.05941 (In English)</mixed-citation></ref><ref id="B25"><mixed-citation>Sainath,&amp;nbsp;T.&amp;nbsp;N., Vinyals,&amp;nbsp;O., Senior,&amp;nbsp;A. and Sak,&amp;nbsp;H. (2015). Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Australia, 4580&amp;ndash;4584. DOI: 10.1109/ICASSP.2015.7178838 (In English)</mixed-citation></ref><ref id="B26"><mixed-citation>Samsitova,&amp;nbsp;L.&amp;nbsp;H. (2020). Dialect as a reflection of the linguistic picture of the world (using the example of the Northwestern dialect of the Bashkir language), Mir nauki, kul&amp;rsquo;tury, obrazovanija, № 6&amp;nbsp;(85), 474&amp;ndash;476. DOI: 10.24412/1991-5500-2020-685-474-476 (In Russian)</mixed-citation></ref><ref id="B27"><mixed-citation>Sciarretta&amp;nbsp;A. (2024). Dialectometry-based classification of the Central&amp;ndash;Southern Italian dialects, Journal of Linguistic Geography, 12&amp;nbsp;(1), 13&amp;ndash;23. DOI: 10.1017/jlg.2024.7 (In English)</mixed-citation></ref><ref id="B28"><mixed-citation>Shamshin,&amp;nbsp;A.&amp;nbsp;L. (2024). The role of knowledge of Italian dialects in intercultural communication: their importance for successful adaptation in Italy, Proceedings of VIII International scientific and methodological conference &amp;ldquo;Problems of teaching philological disciplines to foreign students&amp;rdquo;, Voronezh, Russia, 221&amp;ndash;225.</mixed-citation></ref><ref id="B29"><mixed-citation>Shurykina,&amp;nbsp;L.&amp;nbsp;S and Latukhina,&amp;nbsp;E.&amp;nbsp;A. (2023). Certificate of State Registration of the Computer Program No. 2023668038, 22 Aug 2023. (In Russian).</mixed-citation></ref><ref id="B30"><mixed-citation>Shurykina,&amp;nbsp;L.&amp;nbsp;S and Latukhina,&amp;nbsp;E.&amp;nbsp;A. (2024). Automated creation of dialect dictionaries organization, Current Problems of Applied Mathematics, Informatics and Mechanics, Voronezh, Russia, 1017&amp;ndash;1022. (In Russian)</mixed-citation></ref><ref id="B31"><mixed-citation>Smetanina,&amp;nbsp;Z.&amp;nbsp;V. and Ivanova,&amp;nbsp;G.&amp;nbsp;A (2020). The variation of the word in the &amp;ldquo;Regional dictionary of Vyatka dialects&amp;rdquo;, Vestnik Tomskogo gosudarstvennogo universiteta, 451, 56&amp;ndash;68. DOI: 10.17223/15617793/451/8 (In Russian)</mixed-citation></ref><ref id="B32"><mixed-citation>Themistocleous,&amp;nbsp;C. (2017). Dialect classification using vowel acoustic parameters. Speech Communication, 92, 13&amp;ndash;22. (In English)</mixed-citation></ref><ref id="B33"><mixed-citation>Themistocleous,&amp;nbsp;C. (2019). Dialect Classification From a Single Sonorant Sound Using Deep Neural Networks, Frontiers in Communication, 4. DOI: 10.3389/fcomm.2019.00064 (In English)</mixed-citation></ref><ref id="B34"><mixed-citation>Vernyaeva,&amp;nbsp;R.&amp;nbsp;A. and Zhdanova,&amp;nbsp;E.&amp;nbsp;A. (2023). Multimedia Corpus of Russian Dialects of Udmurtia: Electronic Subcorpus of Spoken Speech. Cuadernos De Rus&amp;iacute;stica Espa&amp;ntilde;ola, 19, 67&amp;ndash;79. DOI: 10.30827/cre.v19.28131 (In Russian)</mixed-citation></ref><ref id="B35"><mixed-citation>Yamani,&amp;nbsp;A., Alziyady,&amp;nbsp;R., AlYami,&amp;nbsp;R., Albelali,&amp;nbsp;S., Albelali,&amp;nbsp;L., Almulhim,&amp;nbsp;J., Alsulami, A., Alfarraj, M., and Al-Zaidy,&amp;nbsp;R. (2024). The kind dataset: A social collaboration approach for nuanced dialect data collection, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 32&amp;ndash;43. (In English)</mixed-citation></ref><ref id="B36"><mixed-citation>Ye,&amp;nbsp;S., Zhao,&amp;nbsp;R. and Fang,&amp;nbsp;X. (2019). An Ensemble Learning Method for Dialect Classification, IOP Conference Series: Materials Science and Engineering, 569 052064. DOI: 10.1088/1757-899X/569/5/052064 (In English)</mixed-citation></ref><ref id="B37"><mixed-citation>Zhang,&amp;nbsp;Y. and Ren,&amp;nbsp;W. (2022). From hǎo to hǒu &amp;ndash; stylising online communication with Chinese dialects, International Journal of Multilingualism, 21&amp;nbsp;(1), 149&amp;ndash;168. DOI: 10.1080/14790718.2022.2061981 (In English)</mixed-citation></ref></ref-list></back></article>