Analysis of incorrect POS-tagging in student texts with linguistic errors in German
The electronic learner corpus of student texts in German, the PACT, contains the parts-of-speech (POS) tagging. This markup is performed automatically using RFTagger. Since the texts of the corpus are written by students, they may contain various kinds of errors: grammatical, spelling, stylistic, and others. Sentences may be formulated incorrectly, without taking into account the rules of the language and accepted norms. This can affect the work of programs that process texts in automatic mode, and as a result, generate incorrect tagging that needs to be verified manually. The purpose of the study is to investigate the degree of influence of various kinds of errors in non-authentic texts on the results of the automatic part-of-speech tagging. Based on expert error markup in the corpus texts, 11 types of errors were identified that affect the part-of-speech tagger quality. For each type of error, ten sentences containing an error were selected from the corpus. The resulting pool of texts was processed by the part-of-speech taggers RFTagger and TreeTagger. The parts of speech that were suggested by these automatic taggers were compared with the parts of speech determined by experts manually. As a result of the comparison, the following patterns were revealed: part-of-speech taggers are mistaken when writing the non-declinable form of the adjective instead of the declinable; when writing one word separately; in the absence of the suffix "-er" in possessive adjectives formed from geographical names; when writing nouns with a lowercase letter; when writing a verb with a capital letter. For each case, the article provides an analysis of the forms and causes of incorrect POS-tagging, as well as differences in the work of the two part-of-speech taggers. Taking into account the revealed patterns will allow more efficient organization of the POS-tagging verification in the learner corpus in German. The results of the study will also be useful for developers of part-of-speech taggers.