DOI: 10.18413/2313-8912-2022-8-3-0-6

Analysis of incorrect POS-tagging in student texts with linguistic errors in German

Irina A. Kotiurova (Petrozavodsk State University, Russia)
Liudmila V. Shchegoleva (Petrozavodsk State University, Russia)

The electronic learner corpus of student texts in German, the PACT, contains the parts-of-speech (POS) tagging. This markup is performed automatically using RFTagger. Since the texts of the corpus are written by students, they may contain various kinds of errors: grammatical, spelling, stylistic, and others. Sentences may be formulated incorrectly, without taking into account the rules of the language and accepted norms. This can affect the work of programs that process texts in automatic mode, and as a result, generate incorrect tagging that needs to be verified manually. The purpose of the study is to investigate the degree of influence of various kinds of errors in non-authentic texts on the results of the automatic part-of-speech tagging. Based on expert error markup in the corpus texts, 11 types of errors were identified that affect the part-of-speech tagger quality. For each type of error, ten sentences containing an error were selected from the corpus. The resulting pool of texts was processed by the part-of-speech taggers RFTagger and TreeTagger. The parts of speech that were suggested by these automatic taggers were compared with the parts of speech determined by experts manually. As a result of the comparison, the following patterns were revealed: part-of-speech taggers are mistaken when writing the non-declinable form of the adjective instead of the declinable; when writing one word separately; in the absence of the suffix "-er" in possessive adjectives formed from geographical names; when writing nouns with a lowercase letter; when writing a verb with a capital letter. For each case, the article provides an analysis of the forms and causes of incorrect POS-tagging, as well as differences in the work of the two part-of-speech taggers. Taking into account the revealed patterns will allow more efficient organization of the POS-tagging verification in the learner corpus in German. The results of the study will also be useful for developers of part-of-speech taggers.

Keywords: POS-tagging, Learner corpus, German, RFTagger, TreeTagger.

Number of views: 2350 (view statistics)

Количество скачиваний: 3336

Full text (HTML)Full text (PDF)Скачать XML To articles list

Information for citation:

Kotiurova, I. A. and Shchegoleva, L. V. (2022). Analysis of incorrect POS-tagging in student texts with linguistic errors in German, Research Result. Theoretical and Applied Linguistics, 8 (3), 87-99. DOI: 10.18413/2313-8912-2022-8-3-0-6

User comments
Reference lists

While nobody left any comments to this publication.
You can be first.

Kotyurova, I. A. (2021). Part-of-speech tagging tools applied to a learner corpus, Pedagogical informatics, 3, 81-89. (In Russian)

Bick, E. (2020). An Annotated Social Media Corpus for German, Proceedings of the 12th international conference on language resources and evaluation, Marseille, France, 6127-6135. (In English)

Bollmann, M. (2013). POS tagging for historical texts with sparse training data, Proceedings of the 7th Linguistic Annotation, Sofia, Bulgaria, 11-18. (In English)

Dıaz-Negrillo, A., Meurers, D., Valera, S. and Wunsch, H. (2010).Towards interlanguage POS annotation for effective learner corpora in SLA and FLT, Language Forum. 36 (1-2), 139-154. (In English)

Dligach, D. and Palmer, M. (2011). Reducing the need for double annotation, Proceedings of the 5th Linguistic Annotation Workshop, Portland, Oregon, USA, 65-73. (In English)

Heeman, P. A. (1998). POS Tagging versus Classes in Language Modeling, Proceedings of the 6th Workshop on Very Large Corpora, 179-187, available at: https://aclanthology.org/W98-1121.pdf (Accessed 22 April 2022). (In English)

Horsmann, T., Erbs, N. and Zesch, T. (2015). Fast or Accurate? – A Comparative Evaluation of PoS Tagging Models, Proceedings of the Int. Conference of the German Society for Computational Linguistics and Language Technology, Duisburg-Essen, Germany, 22–30. (In English)

Keiper, L., Horbach, A. and Thater, S. (2016). Improving POS tagging of German learner language in a reading comprehension scenario, Proceedings of the 10th International Conference on Language Resources and Evaluation, Portorož, Slovenia, 198-205. (In English)

Loftsson, H. (2009). Correcting a POS-tagged corpus using three complementary methods, Proceedings of the 12th Conference of the European Chapter of the ACL, Athens, Greece, 523–531. (In English)

Manning, C. and Schütze, H. (2003). Foundations of statistical natural language processing, Massachusetts Institute of Technology, MIT Press, Cambridge, MA, USA. (In English)

Qian, X. and Liu, Y. (2012). Joint chinese word segmentation, pos tagging and parsing, Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, 501-511. (In English)

Rehbein, I. and Ruppenhofer, J. (2017). Detecting annotation noise in automatically labelled data, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1160-1170. (In English)

Sugisaki, K., Wiedmer, N. and Hausendorf, H. (2018). Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging, Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, 255-259. (In English)

All journals

Send article

Research Result. Theoretical and Applied Linguistics is included in the scientific database of the RINTs (license agreement No. 765-12/2014 dated 08.12.2014).

Журнал включен в перечень рецензируемых научных изданий, рекомендуемых ВАК

The journal is indexed by the following scientific databases and platforms

Research Result. Research Result. Theoretical and Applied Linguistics (ISSN 2313-8912)

The journal materials and website are licensed under Creative Commons «Attribution» 4.0 International.

The Founder: Federal State Autonomous Educational Institution of Higher Education "Belgorod National Research University"The Founder’s address: 85 Pobedy Street, Belgorod, the Belgorod region, 308015, Russia

The Publisher: Federal State Autonomous Educational Institution of HigherEducation "Belgorod National Research University" The Founder’s address:85 Pobedy Street, Belgorod, the Belgorod region, 308015, Russia

Editors Office: chief editor Olga Dekhnich, e-mail: RR_Linguistics@bsuedu.ru, phone: (4722) 301254.

Registered by the Federal Service for Supervision of Communications, Information Technology and Mass Media (Roskomnadzor)

Certificate

Charter of the editorial board of the mass media "Research Result. Theoretical and Applied Linguistics"

Order No. 636-OD dated 30.06.2023 "On approval of the Charters of the editorial boards of the mass media of scientific journals of Belgorod State National Research University"

Order No. 1097-OD from 15.11.2023 "On approval of the Regulations for the publication of scientific journals of Belgorod State National Research University"

Order No. 76-OD from 10.02.2026 "On approval of the composition of the Editorial Board of the journal "Research Result. Theoretical and Applied Linguistics""

Have questions?
You can write to us:

✉ Site administration

✉ Content manager

✉ Executive Secretary