Multi-word expressions for Russian L2 learners: corpora-based selection with expert verification
The article describes the experience of creating a corpus-based list of the most relevant multi-word expressions for Russian L2 learners, distributed across the levels of the Common European Framework of Reference for Languages (CEFR) from A1 to C1. Modern linguistic and cognitive research shows that our speech is patterned and largely consists of stable segments. This fact is supported by the linguodidactic idea of teaching not isolated language units but their combinations of different nature. However, the selection and ranking of multi-word expressions based on language proficiency levels is constrained by the difficulty of automatically extracting them from a corpus of texts and estimating their frequency, as well as disagreements in defining the boundaries, linguistic nature, and terminology of multi-word expressions. This article describes the experience of compiling a list of the most valuable fixed-type multi-word expressions from various sources: two types of existing CEFR-graded vocabulary lists for Russian L2 learners – lexical minimums for the TORFL (Test of Russian as a Foreign Language) system and Russian KELLY (KEywords for Language Learning for Young and adults alike); the most frequent n-grams from the RuFoLa – Russian L2 textbook corpus and from the Russian Web corpus of internet texts; list of discourse formulas from the «Pragmaticon» project. The CEFR level of each multi-word expression is predicted using the frequency-based Max Delta measure, and its effectiveness is subsequently validated through annotation by multiple experts. The resulting list of multi-word expressions contains 1645 entries from A1 to C1 levels. The proposed version of the list has been implemented into an automated text analysis system for learners of Russian as a Foreign Language and can be useful for a wide range of professionals in the preparation of educational content for foreign language learners. The suggested Max Delta measure has demonstrated a high degree of agreement with expert evaluations within proficiency levels A1-B1. This signifies the importance of further exploring its potential in addressing related practical tasks and in selecting language learning content derived for other languages.
Figures
Laposhina, A. N., Khramchanka, T. A. and Lebedeva, M.Yu. (2024) Multi-word expressions for Russian L2 learners: corpora-based selection with expert verification, Research Result. Theoretical and Applied Linguistics, 10 (2), 117-137. DOI: 10.18413/2313-8912-2024-10-2-0-6
While nobody left any comments to this publication.
You can be first.
Eremina, O. S. (2020). Russian non-free expressions in the speech of foreigners: a corpus approach, Russkiy yazyk za rubezhom, 6 (283), 29-35. https://doi.org/10.37632/PI.2020.283.6.004(In Russian)
Inkova, O. Ju. (2015). On the question of lemmatization of multi-component units, Proceedings of the international conference «Corpus Linguistics – 2015», June 22–26, 2015, St. Petersburg, Russia, 1-10. (In Russian)
Iordanskaya, L. N. and Melchuk, I. A. (2007). Smysl i sochetaemost v slovare [Meaning and combinability in the dictionary], M.: Yazyki slavjanskih kul’tur, Moscow, Russia. (In Russian)
Laposhina, A. N. (2020). A corpus of Russian textbook materials for foreign students as an instrument of an educational content analysis, Russkiy yazyk za rubezhom, 6 (283), 22-28. https://doi.org/10.37632/PI.2020.283.6.003(In Russian)
Minaeva, E. V. (2017). Discursive words in modern colloquial speech and in RCT textbooks, Mezhdunarodniy aspirantskiy vestnik, 2, 74-79. (In Russian)
Puzhaeva, S. Ju., Gerasimenko, E. A., Zakharova, E. S., Rakhilina, E. V. (2018). Automatic extraction of discourse formulas from Russian language texts, Vestn. Novosib. gos. un-ta. Seriya: Lingvistika i mezhkulturnaya kommunikatsiya, 16 (2), 5-18. https://doi.org/10.25205/1818-7935-2018-16-2-5-18(In Russian)
Svirina, L. O. (2019). Formal language and level of foreign language communicative competence, Filologiya i kultura, 1 (55), 97-101. (In Russian)
Shlyakhov, V. I. and Saakyan, L. N. (2015). Tekst v kommunikativnom prostranstve [Text in communicative space], Moscow, Russia. (In Russian)
Alfter, D., Bizzoni, Y., Agebjorn, A., Volodina, E. and Pilan, I. (2016). From distributions to labels: A lexical proficiency analysis using learner corpora, Proceedings of the joint workshop on NLP4CALL and NLP for Language Acquisition at SLTC, Umeå, Sweden, November 2016, 130, 1-7. (In English)
Bahns, J. and Eldaw, M. (1993). Should We Teach EFL Students Collocations? System, 21(1), 101-114. (In English)
Bybee, J. (1998). The emergent lexicon, Chicago Linguistic Society, 34, 421-435. (In English)
Calzolari, N., Fillmore, C., Grishman, R., Ide, N., Lenci, A., Macleod, C. and Zampolli, A. (2002). Towards best practice for multiword expressions in computational lexicons, Proceedings of LREC 2002, Las Palmas, Canary Islands – Spain, May 2002, 1934-1940. (In English)
Christiansen, M. H. and Chater, N. (2016). The Now-or-Never bottleneck: A fundamental constraint on language, Behavioral & Brain Sciences, 39, 62-102. https://doi.org/10.1017/S0140525X1500031X(In English)
De Cock, S., Granger, S., Leech, G. and Mcenery, T. (1998). An automated approach to the phrasicon of EFL learners, Learner English on computer, 67-79. https://doi.org/10.4324/9781315841342(In English)
Volodina, Е., Pilán, I., Enström, I., Llozhi, L., Lundkvist, P., Sundberg, G. and Sandell, M. (2016). SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies, Proceedings of LREC 2016, Portorož, Slovenia, May 2016, 206-212. (In English)
Elman, J. L. (2009). On the meaning of words and dinosaur bones: Lexical knowledge without a lexicon, Cognitive Science, 33, 547-582. https://doi.org/10.1111/j.1551-6709.2009.01023.x(In English)
François, T., Gala, N., Watrin, P. and Fairon, C. (2014). FLELex: a graded lexical resource for French foreign learners, The 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, 26-31 May 2014, 3766–3773. (In English)
Janda, L., Endresen, A., Zhukova, V., Mordashova, D. and Rakhilina, E. (2020). How to build a constructicon in five years: The Russian Example, The Wealth and Breadth of Construction-Based Research (a thematic issue of Belgian Journal of Linguistics), 34, 162-175. (In English)
Jolsvai, H., McCauley, S. M. and Christiansen, M. H. (2013). Meaning overrides frequency in idiomatic and compositional multiword chunks, Proceedings of the 35th Annual Conference of the Cognitive Science Society, Austin, Tx, January 2013, 692–697. (In English)
Kilgarriff, A., Charalabopoulou, F., Gavrilidou, M., Johannessen, J., Saussan, K., Kokkinakis, S., Lew, R., Sharoff, S., Vadlapudi, R. and Volodina, E. (2014). Corpus-Based Vocabulary lists for Language Learners for Nine Languages, Language Resources and Evaluation Journal, 48, 121-163. https://doi.org/10.1007/s10579-013-9251-2(In English)
Kopotev, M., Escoter, L., Kormacheva, D., Pierce, M., Pivovarova, L. and Yangarber, R. (2015). CoCoCo: Online Extraction of Russian Multiword Expressions, The 5th Workshop on Balto-Slavic Natural Language Processing 2015, Hissar, Bulgaria, September 2015, 43-45. (In English)
Kopotev, M., Pivovarova, L. and Kormacheva, D. (2016). Constructional generalization over Russian collocations, Mémoires de la Société néophilologique de Helsinki, Tome C (Collocations Cross-Linguistically), 121-140. (In English)
Kopotev, M., Pivovarova, L., Kochetkova, N. and Yangarber, R. (2013). Automatic detection of stable grammatical features in n-grams, Proceedings of the 9th Workshop on Multiword Expressions, Atlanta, GA, June 2013, 73-81. (In English)
Lewis, M. (1997). Implementing the Lexical Approach: Putting Theory into Practice, Language Teaching Publications, Hove, England. (In English)
Loukachevitch, N. and Lashevich, G. (2016). Multiword expressions in Russian Thesauri RuThes and RuWordNet, Proceedings of the AINL FRUCT 2016 Conference, Saint Petersburg, Russia, December 2016, 66-71. (In English)
McClelland, J. L. (2010). Emergence in cognitive science, Topics in Cognitive Science, 2 (4), 751-770. https://doi.org/10.1111/j.1756-8765.2010.01116.x(In English)
Paquot, M. and Granger, S. (2012). Formulaic Language in Learner Corpora, Annual Review of Applied Linguistics, 32, 130–149. https://doi.org/10.1017/S0267190512000098(In English)
Parmentier, Y. and Waszczuk, J. (2019). Representation and parsing of multiword expressions: Current trends (Phraseology and Multiword Expressions 3), Language Science Press, Berlin, Germany, 326. (In English)
Schmitt, N. (2004). Formulaic Sequences: Acquisition, processing and use, John Benjamins Publishing Company, Amsterdam, Netherlands, 304. (In English)
Wray, A. (2000). Formulaic sequences in second language teaching: Principles and practice, Applied Linguistics, 21(4), 463-489. https://doi.org/10.1093/applin/21.4.463(In English)
Wray, A. (2002). Formulaic language and the lexicon, Cambridge University Press, Cambridge, UK. (In English)
The article was prepared with the financial support of the state assignment of Ministry of Education and Science of the Russian Federation for 2020–2024 (No. FZNM-2020-0005) - Laposhina A.N., Lebedeva M.Yu. The study was conducted during the participation of Khramchanka T. A. in the fellowship programme «InteRussia» of the Gorchakov Fund.