16+
DOI: 10.18413/2313-8912-2024-10-4-0-6

Using CNN and LSTM neural networks  for Arkhangelsk dialect word identification and classification

The study of dialects provides an opportunity to gain an understanding of the culture and history of a people, which are reflected in language. Dialectal vocabulary differs from standard vocabulary in terms of both meaning and pronunciation, as well as word formation and grammatical structures, primarily in morphology. Similar patterns can also be observed in the Arkhangelsk dialects. The aim of this paper is to develop a dialect words classifier, which can be used to identify dialect words within a given text and categorize them into one of the pre-defined groups. The novelty of this research lies in the lack of an automated system for classifying dialect words based on Arkhangelsk dialect materials. The article describes the development of a neural network for dialect words identification and classification. Dialect words were identified from dialect texts gathered during dialectological research conducted between the 1960s and the present day. LSTM (long short-term memory) and CNN (convolutional neural network) architectures are compared. One of the main challenges in the task of dialect word classification is that the neural network is trained using a limited amount of data. To overcome these limitations, we are investigating the possibility of using a bigram-based approach in addition to the unigram-based words encoding. A trained model that demonstrated the best results was integrated into our application for dialect words processing and analysis. Confusion matrix was constructed for the best model which demonstrates the highest performance for the derivational class and the lowest for the lexical class.

Figures

Number of views: 27 (view statistics)
Количество скачиваний: 136
Full text (HTML)Full text (PDF)To articles list
  • User comments
  • Reference lists
  • Thanks

While nobody left any comments to this publication.
You can be first.

Leave comment: