<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<article article-type="research-article" dtd-version="1.2" xml:lang="ru" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><front><journal-meta><journal-id journal-id-type="issn">2313-8912</journal-id><journal-title-group><journal-title>Research Result. Theoretical and Applied Linguistics</journal-title></journal-title-group><issn pub-type="epub">2313-8912</issn></journal-meta><article-meta><article-id pub-id-type="doi">10.18413/2313-8912-2023-9-1-0-4</article-id><article-id pub-id-type="publisher-id">3061</article-id><article-categories><subj-group subj-group-type="heading"><subject>TEXT COMPLEXITY PREDICTORS: METHODS AND APPROACHES FOR ASSESSMENT</subject></subj-group></article-categories><title-group><article-title>&lt;strong&gt;Classification of Russian textbooks by grade level and topic using ReaderBench&lt;/strong&gt;</article-title><trans-title-group xml:lang="en"><trans-title>&lt;strong&gt;Classification of Russian textbooks by grade level and topic using ReaderBench&lt;/strong&gt;</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Paraschiv</surname><given-names>Andrei</given-names></name><name xml:lang="en"><surname>Paraschiv</surname><given-names>Andrei</given-names></name></name-alternatives><email>andrei.paraschiv74@upb.ro</email><xref ref-type="aff" rid="aff1" /></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Dascalu</surname><given-names>Mihai</given-names></name><name xml:lang="en"><surname>Dascalu</surname><given-names>Mihai</given-names></name></name-alternatives><email>mihai.dascalu@upb.ro</email><xref ref-type="aff" rid="aff1" /></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Solnyshkina</surname><given-names>Marina I.</given-names></name><name xml:lang="en"><surname>Solnyshkina</surname><given-names>Marina I.</given-names></name></name-alternatives><email>mesoln@yandex.ru</email><xref ref-type="aff" rid="aff2" /></contrib></contrib-group><aff id="aff2"><institution>Kazan Federal University, Russia</institution></aff><aff id="aff1"><institution>Polytechnic University of Bucharest, Romania</institution></aff><pub-date pub-type="epub"><year>2023</year></pub-date><volume>9</volume><issue>1</issue><fpage>0</fpage><lpage>0</lpage><self-uri content-type="pdf" xlink:href="/media/linguistics/2023/1/Лингвистика_9_1_2023-50-63.pdf" /><abstract xml:lang="ru"><p>Textbooks are essential resources for classroom and offline reading, while the quality of learning materials guides the entire learning process. One of the most important factors to be considered is their readability and comprehensibility. Therefore, the correct pairing of textbook complexity and student grade level is paramount. This article analyzes automated classification methods for Russian-language textbooks on two dimensions, namely the topic of the text and its complexity reflected by its corresponding school grade level. The studied corpus is a collection of 154 textbooks from the Russian Federation from the second to the eleventh grade levels. Our analysis considers machine learning techniques on the textual complexity indices provided by the open-source multi-language framework ReaderBench and BERT-based models for the classification tasks. Additionally, we explore using the most predictive ReaderBench features in conjunction with contextualized embeddings from BERT. Our results argue that incorporating textual complexity indices improves the classification performance of BERT-based models on our dataset split using 2&amp;nbsp;strategies. More specifically, the F1 score for topic classification improved to 92.63%, while the F1 score for school grade-level classification improved to 54.06% for the Greedy approach in which multiple adjacent paragraphs are considered a single text unit up until reaching the maximum length of 512 tokens imposed by the language model.</p></abstract><trans-abstract xml:lang="en"><p>Textbooks are essential resources for classroom and offline reading, while the quality of learning materials guides the entire learning process. One of the most important factors to be considered is their readability and comprehensibility. Therefore, the correct pairing of textbook complexity and student grade level is paramount. This article analyzes automated classification methods for Russian-language textbooks on two dimensions, namely the topic of the text and its complexity reflected by its corresponding school grade level. The studied corpus is a collection of 154 textbooks from the Russian Federation from the second to the eleventh grade levels. Our analysis considers machine learning techniques on the textual complexity indices provided by the open-source multi-language framework ReaderBench and BERT-based models for the classification tasks. Additionally, we explore using the most predictive ReaderBench features in conjunction with contextualized embeddings from BERT. Our results argue that incorporating textual complexity indices improves the classification performance of BERT-based models on our dataset split using 2&amp;nbsp;strategies. More specifically, the F1 score for topic classification improved to 92.63%, while the F1 score for school grade-level classification improved to 54.06% for the Greedy approach in which multiple adjacent paragraphs are considered a single text unit up until reaching the maximum length of 512 tokens imposed by the language model.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>Text readability</kwd><kwd>Russian language</kwd><kwd>Textbook analysis</kwd><kwd>Topic classification</kwd><kwd>ReaderBench framework</kwd><kwd>Textual complexity indices</kwd><kwd>Transformer-based Language Model</kwd></kwd-group><kwd-group xml:lang="en"><kwd>Text readability</kwd><kwd>Russian language</kwd><kwd>Textbook analysis</kwd><kwd>Topic classification</kwd><kwd>ReaderBench framework</kwd><kwd>Textual complexity indices</kwd><kwd>Transformer-based Language Model</kwd></kwd-group></article-meta></front><back><ack><p>This work was supported by a grant of the Ministry of Research, Innovation and Digitalization, Project CloudPrecis, Contract Number 344/390020/06.09.2021, MySMIS code:&amp;nbsp;124812, within POC. We thank the Research Lab Text Analytics at Kazan Federal University for assisting in compiling the corpus of academic texts and cooperating to conduct the research.</p></ack><ref-list><title>Список литературы</title><ref id="B1"><mixed-citation>Bansiong, A.&amp;nbsp;J. (2019). Readability, content, and mechanical feature analysis of selected commercial science textbooks intended for third grade Filipino learners, Cogent Education, 6, 1706395. DOI:&amp;nbsp;10.1080/2331186X.2019.1706395 (In English)</mixed-citation></ref><ref id="B2"><mixed-citation>Batinic,&amp;nbsp;D., Birzer,&amp;nbsp;S., Zinsmeister,&amp;nbsp;H. (2017). Automatic classification of Russian texts for didactic purposes, Trudy meždunarodnoj konferencii &amp;ldquo;Korpusnaja lingvistika&amp;rdquo;, Sankt-Peterburg, Russia, 9-15. (In English)</mixed-citation></ref><ref id="B3"><mixed-citation>Ben&amp;iacute;čkov&amp;aacute;,&amp;nbsp;Z., Voj&amp;iacute;ř,&amp;nbsp;K. and Held,&amp;nbsp;L., (2021). A comparative analysis of text difficulty in Slovak and Canadian science textbooks, Chemistry-Didactics-Ecology-Metrology, 26&amp;nbsp;(1-2), 89&amp;ndash;97. DOI: 10.2478/cdem-2021-0007 (In English)</mixed-citation></ref><ref id="B4"><mixed-citation>Bosco,&amp;nbsp;G.&amp;nbsp;L., Pilato,&amp;nbsp;G. and Schicchi,&amp;nbsp;D. (2021). Deepeva: A deep neural network architecture for assessing sentence complexity in Italian and English languages, Array, 12, 100097. (In English)</mixed-citation></ref><ref id="B5"><mixed-citation>Brown,&amp;nbsp;P.&amp;nbsp;F., Della Pietra,&amp;nbsp;S.&amp;nbsp;A., Della Pietra,&amp;nbsp;V.&amp;nbsp;J., Lai,&amp;nbsp;J.&amp;nbsp;C. and Mercer,&amp;nbsp;R.&amp;nbsp;L. (1992). An estimate of an upper bound for the entropy of English, Computational Linguistics, 18, 31&amp;ndash;40. (In English)</mixed-citation></ref><ref id="B6"><mixed-citation>Chatzipanagiotidis,&amp;nbsp;S., Giagkou,&amp;nbsp;M. and Meurers,&amp;nbsp;D. (2021). Broad linguistic complexity analysis for Greek readability classification, Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, 48&amp;ndash;58. (In English)</mixed-citation></ref><ref id="B7"><mixed-citation>Chung,&amp;nbsp;H.&amp;nbsp;W., Hou,&amp;nbsp;L., Longpre,&amp;nbsp;S., Zoph,&amp;nbsp;B., Tay,&amp;nbsp;Y., Fedus,&amp;nbsp;W., Li,&amp;nbsp;E., Wang,&amp;nbsp;X., Dehghani,&amp;nbsp;M., Brahma,&amp;nbsp;S. et al. (2022). Scaling instruction- finetuned language models, arXiv preprint arXiv:2210.11416. https://doi.org/10.48550/arXiv.2210.11416 (In English)</mixed-citation></ref><ref id="B8"><mixed-citation>Churunina,&amp;nbsp;A., Solnyshkina,&amp;nbsp;M., Gafiyatova,&amp;nbsp;E. and Zaikin,&amp;nbsp;A. (2020). Lexical features of text complexity: the case of Russian academic texts, SHS Web of Conferences, 88, 01009. https://doi.org/10.1051/shsconf/20208801009 (In English)</mixed-citation></ref><ref id="B9"><mixed-citation>Corlatescu,&amp;nbsp;D., Ruseti,&amp;nbsp;S. and Dascalu,&amp;nbsp;M. (2022). Readerbench learns Russian: Multilevel analysis of Russian text characteristics, Russian Journal of Linguistics, 26&amp;nbsp;(2), 342&amp;ndash;370. https://doi.org/10.22363/2687-0088-30145 (In English)</mixed-citation></ref><ref id="B10"><mixed-citation>Crossley,&amp;nbsp;S.&amp;nbsp;A., Greenfield,&amp;nbsp;J. and McNamara,&amp;nbsp;D.&amp;nbsp;S. (2008). Assessing text readability using cognitively based indices, Tesol Quarterly, 42, 475&amp;ndash;493. https://doi.org/10.1002/j.1545-7249.2008.tb00142.x (In English)</mixed-citation></ref><ref id="B11"><mixed-citation>Dascalu,&amp;nbsp;M., Gutu,&amp;nbsp;G., Ruseti,&amp;nbsp;S., Paraschiv,&amp;nbsp;I.&amp;nbsp;C., Dessus,&amp;nbsp;P., McNamara,&amp;nbsp;D.&amp;nbsp;S., Crossley,&amp;nbsp;S.&amp;nbsp;A. and Trausan-Matu,&amp;nbsp;S. (2017). ReaderBench: a multi-lingual framework for analyzing text complexity, Data Driven Approaches in Digital Education: 12th European Conference on Technology Enhanced Learning, EC-TEL 2017, Tallinn, Estonia, 495&amp;ndash;499. https://doi.org/10.1007/978-3-319-66610-5_48 (In English)</mixed-citation></ref><ref id="B12"><mixed-citation>Dascalu,&amp;nbsp;M., McNamara,&amp;nbsp;D.&amp;nbsp;S., Trausan-Matu,&amp;nbsp;S. and Allen,&amp;nbsp;L. (2018). Cohesion network analysis of CSCL participation, Behavior Research Methods, 50, 604&amp;ndash;619. https://doi.org/10.3758/s13428-017-0888-4 (In English)</mixed-citation></ref><ref id="B13"><mixed-citation>Ivanov,&amp;nbsp;V.&amp;nbsp;V. (2022). Sentence-level complexity in Russian: An evaluation of BERT and graph neural networks, Frontiers in Artificial Intelligence, 5. https://doi.org/10.3389/frai.2022.1008411&amp;nbsp; (In English)</mixed-citation></ref><ref id="B14"><mixed-citation>Khine,&amp;nbsp;M.&amp;nbsp;S. (2013). Analysis of science textbooks for instructional effectiveness, in Khine,&amp;nbsp;M.&amp;nbsp;S. (ed.), Critical Analysis of Science Textbooks: Evaluating instructional effectiveness, Springer, Dordrecht, Netherlands, 303-310. http://doi.org/10.1007/978-94-007-4168-3_15 (In English)</mixed-citation></ref><ref id="B15"><mixed-citation>Kincaid,&amp;nbsp;J.&amp;nbsp;P., Fishburne&amp;nbsp;Jr,&amp;nbsp;R.&amp;nbsp;P., Rogers,&amp;nbsp;R.&amp;nbsp;L. and Chissom,&amp;nbsp;B.&amp;nbsp;S. (1975). Derivation of new readability formulas (Automated Readability Index, Fog count and Flesch Reading Ease formula) for navy enlisted personnel, Institute for Simulation and Training, 56. (In English)</mixed-citation></ref><ref id="B16"><mixed-citation>Kruskal,&amp;nbsp;W.&amp;nbsp;H. and Wallis,&amp;nbsp;W.&amp;nbsp;A. (1952). Use of ranks in one-criterion variance analysis, Journal of the American statistical Association, 47, 583&amp;ndash;621. https://doi.org/10.2307/2280779 (In English)</mixed-citation></ref><ref id="B17"><mixed-citation>Kuratov,&amp;nbsp;Y. and Arkhipov,&amp;nbsp;M. (2019). Adaptation of deep bidirectional multilingual transformers for Russian language, arXiv preprint arXiv:1905.07213. https://doi.org/10.48550/arXiv.1905.07213 (In English)</mixed-citation></ref><ref id="B18"><mixed-citation>Lu,&amp;nbsp;Z., Du,&amp;nbsp;P. and Nie,&amp;nbsp;J.&amp;nbsp;Y. (2020). VGCN-BERT: augmenting BERT with graph embedding for text classification, Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, 12035, 369&amp;ndash;382. https://doi.org/10.1007/978-3-030-45439-5_25 (In English)</mixed-citation></ref><ref id="B19"><mixed-citation>Norris,&amp;nbsp;J.&amp;nbsp;M. and Ortega,&amp;nbsp;L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity, Applied linguistics, 30&amp;nbsp;(4), 555&amp;ndash;578. https://doi.org/10.1093/applin/amp044 (In English)</mixed-citation></ref><ref id="B20"><mixed-citation>Sakhovskiy,&amp;nbsp;A., Solovyev,&amp;nbsp;V. and Solnyshkina,&amp;nbsp;M. (2020). Topic modeling for assessment of text complexity in Russian textbooks, 2020 Ivannikov Ispras Open Conference (ISPRAS), Moscow, Russia, 102&amp;ndash;108. https://doi.org/10.1109/ISPRAS51486.2020.00022 (In English)</mixed-citation></ref><ref id="B21"><mixed-citation>Santucci,&amp;nbsp;V., Santarelli,&amp;nbsp;F., Forti,&amp;nbsp;L. and Spina,&amp;nbsp;S., (2020). Automatic classification of text complexity, Applied Sciences, 10, 7285. https://doi.org/10.3390/app10207285 (In English)</mixed-citation></ref><ref id="B22"><mixed-citation>Shannon,&amp;nbsp;C.&amp;nbsp;E. (1948). A mathematical theory of communication, The Bell System Technical Journal, 27&amp;nbsp;(3), 379&amp;ndash;423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x (In English</mixed-citation></ref><ref id="B23"><mixed-citation>Shapiro,&amp;nbsp;S.&amp;nbsp;S. and Wilk,&amp;nbsp;M.&amp;nbsp;B. (1965). An analysis of variance test for normality (complete samples), Biometrika, 52&amp;nbsp;(3/4), 591&amp;ndash;611. https://doi.org/10.2307/2333709 (In English)</mixed-citation></ref><ref id="B24"><mixed-citation>Solovyev,&amp;nbsp;V., Ivanov,&amp;nbsp;V. and Solnyshkina,&amp;nbsp;M. (2018). Assessment of reading difficulty levels in Russian academic texts: Approaches and metrics, Journal of Intelligent &amp;amp; Fuzzy Systems: Applications in Engineering and Technology, 34&amp;nbsp;(5), 3049&amp;ndash;3058. https://doi.org/10.3233/JIFS-169489 (In English)</mixed-citation></ref><ref id="B25"><mixed-citation>Solovyev,&amp;nbsp;V.&amp;nbsp;D., Ivanov,&amp;nbsp;V&amp;nbsp;V. and Akhtiamov,&amp;nbsp;R.&amp;nbsp;B. (2019). Dictionary of abstract and concrete words of the Russian language: a methodology for creation and application, Research in Applied Linguistics, 10, 218&amp;ndash;230. https://doi.org/10.22055/RALS.2019.14684 (In English)</mixed-citation></ref><ref id="B26"><mixed-citation>Solovyev,&amp;nbsp;V., Solnyshkina,&amp;nbsp;M., Gafiyatova,&amp;nbsp;E., McNamara,&amp;nbsp;D. and Ivanov,&amp;nbsp;V. (2019). Sentiment in academic texts, Proceedings of the 24th Conference of Open Innovations Association FRUCT, IEEE Computer Society, Moscow, Russia, 408&amp;ndash;414. https://doi.org/10.23919/FRUCT.2019.8711900 (In English)</mixed-citation></ref><ref id="B27"><mixed-citation>Solovyev,&amp;nbsp;V., Ivanov,&amp;nbsp;V. and Solnyshkina,&amp;nbsp;M. (2020a). Thesaurus-based methods for assessment of text complexity in Russian, Advances in Computational Intelligence: 19th Mexican International Conference on Artificial Intelligence, MICAI 2020, Proceedings, Part II, Mexico City, Mexico, 152&amp;ndash;166. https://doi.org/10.1007/978-3-030-60887-3_14 (In English)</mixed-citation></ref><ref id="B28"><mixed-citation>Solovyev,&amp;nbsp;V.&amp;nbsp;D., Solnyshkina,&amp;nbsp;M., Andreeva,&amp;nbsp;M., Danilov,&amp;nbsp;A. and Zamaletdinov,&amp;nbsp;R. (2020b). Text complexity and abstractness: Tools for the Russian language, Proceedings of the International Conference &amp;ldquo;Internet and Modern Society&amp;rdquo; (IMS-2020), St. Petersburg, Russia, 75&amp;ndash;87. (In English)</mixed-citation></ref><ref id="B29"><mixed-citation>Swanepoel,&amp;nbsp;S. (2010). The assessment of the quality of science education textbooks: Conceptual framework and instruments for analysis, Ph.D. Thesis, University of South Africa, Pretoria, South Africa. (In English)</mixed-citation></ref><ref id="B30"><mixed-citation>Wakefield,&amp;nbsp;J.&amp;nbsp;F. (2006). Textbook usage in the United States: The case of US history, International Seminar on Textbooks, Santiago, Chile, Online Submission. (In English)</mixed-citation></ref><ref id="B31"><mixed-citation>Xanthopoulos,&amp;nbsp;P., Pardalos,&amp;nbsp;P.&amp;nbsp;M. and Trafalis,&amp;nbsp;T.&amp;nbsp;B. (2013). Linear discriminant analysis, Robust Data Mining, Springer, 27&amp;ndash;33. (In English)</mixed-citation></ref><ref id="B32"><mixed-citation>Zipitria,&amp;nbsp;I., Sierra,&amp;nbsp;B., Arruarte,&amp;nbsp;A. and Elorriaga,&amp;nbsp;J.&amp;nbsp;A. (2012). Cohesion grading decisions in a summary evaluation environment: A machine learning approach, Proceedings of the Annual Meeting of the Cognitive Science Society, 34, 2615&amp;ndash;2620. (In English)</mixed-citation></ref></ref-list></back></article>