<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<article article-type="research-article" dtd-version="1.2" xml:lang="ru" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><front><journal-meta><journal-id journal-id-type="issn">2313-8912</journal-id><journal-title-group><journal-title>Научный результат. Вопросы теоретической и прикладной лингвистики</journal-title></journal-title-group><issn pub-type="epub">2313-8912</issn></journal-meta><article-meta><article-id pub-id-type="doi">10.18413/2313-8912-2024-10-2-0-4</article-id><article-id pub-id-type="publisher-id">3499</article-id><article-categories><subj-group subj-group-type="heading"><subject>ПРИКЛАДНАЯ ЛИНГВИСТИКА</subject></subj-group></article-categories><title-group><article-title>&lt;strong&gt;Написанный vs сгенерированный текст: &amp;laquo;естественность&amp;raquo; как категория текстовая и психолингвистическая&lt;/strong&gt;</article-title><trans-title-group xml:lang="en"><trans-title>&lt;strong&gt;Written vs generated text: &amp;ldquo;naturalness&amp;rdquo; as a textual and psycholinguistic category&lt;/strong&gt;</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Колмогорова</surname><given-names>Анастасия Владимировна</given-names></name><name xml:lang="en"><surname>Kolmogorova</surname><given-names>Anastasia Vladimirovna</given-names></name></name-alternatives><email>akolmogorova@hse.ru</email><xref ref-type="aff" rid="aff1" /></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Марголина</surname><given-names>Анастасия Валерьевна</given-names></name><name xml:lang="en"><surname>Margolina</surname><given-names>Anastasia Valerievna</given-names></name></name-alternatives><email>avmargolina@edu.hse.ru</email></contrib></contrib-group><aff id="aff1"><institution>Национальный исследовательский университет «Высшая школа экономики»</institution></aff><pub-date pub-type="epub"><year>2024</year></pub-date><volume>10</volume><issue>2</issue><fpage>0</fpage><lpage>0</lpage><self-uri content-type="pdf" xlink:href="/media/linguistics/2024/2/2024-02_июнь_Том_10_2-71-99.pdf" /><abstract xml:lang="ru"><p>В контексте развития технологий текстовой генерации оппозиция &amp;laquo;естественность &amp;ndash; неестественность текста&amp;raquo; трансформируется в новую дихотомию: &amp;laquo;естественность &amp;ndash; искусственность&amp;raquo;. Цель данной статьи &amp;ndash; исследовать феномен естественности в данном контексте с двух точек зрения: анализа лингвистических характеристик естественного текста на фоне сгенерированного (искусственного) и интроспективных представлений информантов-носителей русского языка относительно того, каким должен быть &amp;laquo;естественный&amp;raquo; текст, и чем он должен отличаться от сгенерированного. Материалом для исследования послужил параллельный корпус кинорецензий на русском языке, состоящий из двух подкорпусов: рецензий, написанных людьми, и сгенерированных большой языковой моделью на основе промптов, представляющих собой начала отзывов из первого подкорпуса. Для сопоставительного анализа двух подкорпусов применялись следующие методы: метод компьютерной обработки текстов для подсчета значений 130 метрик лингвистической сложности текста; метод психолингвистического эксперимента; метод экспертного анализа текста; метод сравнительно-сопоставительного анализа. В результате было определено, что с точки зрения собственных лингвистических характеристик &amp;laquo;естественные&amp;raquo; тексты отличаются от сгенерированных преимущественно большей гибкостью синтаксической структуры, допускающей как пропуск или сокращение структур, так и избыточность, а также &amp;minus; большей лексической вариативностью. Естественность же как категория психолингвистическая связана с автостереотипными представлениями информантов о том, какими когнитивными характеристиками обладают люди как вид. Анализ ошибочно атрибутированных информантами текстов (сгенерированных, размеченных как естественные, и наоборот) показал, что ряд характеристик данного автостереотипа переоцениваются информантами, другие же, в целом, коррелируют с лингвистической спецификой текстов из подкорпуса написанных рецензий. В заключение сформулированы определения естественности как текстовой и психолингвистической категории</p></abstract><trans-abstract xml:lang="en"><p>In the context of the development of text generation technologies, the opposition &amp;ldquo;naturalness &amp;minus; unnaturalness of text&amp;rdquo; has been transformed into a new dichotomy: &amp;ldquo;naturalness &amp;ndash; artificiality&amp;rdquo;. The aim of this article is to investigate the phenomenon of naturalness in this context from two perspectives: analyzing the linguistic characteristics of a natural text against a generated (artificial) text and systematizing introspective perceptions of Russian native speaker informants as to what a &amp;ldquo;natural&amp;rdquo; text should be like and how it should differ from a generated text. The material for the study was a parallel corpus of film reviews in Russian, consisting of two subcorpora: reviews written by people and those generated by a large language model based on prompts, which are the beginnings of reviews, from the first subcorpus. The following methods were applied for the comparative analysis of the two subcorpora: computer-assisted text processing for calculating the values of 130 metrics of text linguistic complexity, psycholinguistic experiment, expert text analysis, contrastive analysis. As a result, it was determined that from the point of view of their own linguistic characteristics, &amp;ldquo;natural&amp;rdquo; texts differ from generated texts mainly by greater flexibility of syntactic structure, allowing both omission or reduction of structures and redundancy, as well as by slightly greater lexical variability. Naturalness as a psycholinguistic category is related to the informants&amp;rsquo; autostereotypical ideas about the cognitive characteristics of people as a species. The analysis of texts erroneously attributed by informants (generated, labelled as natural and vice versa) showed that a number of characteristics of this autostereotype are overestimated by informants, while others, in general, correlate with the linguistic specificity of texts from the subcorpus of written reviews. In conclusion, we formulate definitions of naturalness as a textual and psycholinguistic category.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>Контролируемая генерация</kwd><kwd>Естественность</kwd><kwd>Текстовая категория</kwd><kwd>Психолингвистическая категория</kwd><kwd>Метрики лингвистической сложности</kwd><kwd>Эксперимент</kwd><kwd>Русский язык</kwd></kwd-group><kwd-group xml:lang="en"><kwd>Controlled generation</kwd><kwd>Naturalness</kwd><kwd>Text category</kwd><kwd>Psycholinguistic category</kwd><kwd>Metrics of text complexity</kwd><kwd>Experiment</kwd><kwd>Russian language</kwd></kwd-group></article-meta></front><back><ack><p>Статья подготовлена по материалам проекта &amp;laquo;Текст как Big Data: методы и модели работы с большими текстовыми данными&amp;raquo;, выполняемого в рамках Программы фундаментальных исследований НИУ ВШЭ в 2024 году.</p></ack><ref-list><title>Список литературы</title><ref id="B1"><mixed-citation>Alzahrani,&amp;nbsp;E.&amp;nbsp;and&amp;nbsp;Jololian,&amp;nbsp;L. (2021). How Different Text-Preprocessing Techniques Using The BERT Model Affect The Gender Profiling of Authors, arXiv preprint arXiv: 2109.13890. https://doi.org/10.48550/arXiv.2109.13890 (In English)</mixed-citation></ref><ref id="B2"><mixed-citation>Bally,&amp;nbsp;Ch. (1913). Le language et la vie, Edition Atar, Paris, France. (In French)</mixed-citation></ref><ref id="B3"><mixed-citation>Bender,&amp;nbsp;E.&amp;nbsp;M., Gebru,&amp;nbsp;T., McMillan-Major,&amp;nbsp;A. and Shmitchell,&amp;nbsp;Sh. (2021). On the dangers of stochastic parrots: Can language models be too big?, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610&amp;ndash;623. (In English)</mixed-citation></ref><ref id="B4"><mixed-citation>Blinova, O. and Tarasov, N. (2022). A hybrid model of complexity estimation: Evidence from Russian legal texts, Frontiers in Artificial Intelligence, 5. https://doi.org/10.3389/frai.2022.1008530 (In English)</mixed-citation></ref><ref id="B5"><mixed-citation>Celikyilmaz,&amp;nbsp;A., Clark,&amp;nbsp;E. and&amp;nbsp;Gao,&amp;nbsp;J. (2021). Evaluation of text generation: A survey, arXiv preprint arXiv: 2006.14799. https://doi.org/10.48550/arXiv.2006.14799 (In English)</mixed-citation></ref><ref id="B6"><mixed-citation>Dashela,&amp;nbsp;T. and Mustika,&amp;nbsp;Y. (2021). An Analysis of Cohesion and Coherence in Written Text of Line Today about Wedding Kahiyang Ayu and Bobby Nasution, SALEE: Study of Applied Linguistics and English Education, 2&amp;nbsp;(2), 192&amp;minus;203. https://doi.org/10.35961/salee.v2i02.282 (In English)</mixed-citation></ref><ref id="B7"><mixed-citation>Fauconnier, G. (1981). Pragmatic functions and mental spaces, Cognition, 10 (1-3), 85&amp;minus;88. (In English)</mixed-citation></ref><ref id="B8"><mixed-citation>Holtzman,&amp;nbsp;A., Buys,&amp;nbsp;J., Du,&amp;nbsp;L., Forbes,&amp;nbsp;M. and Choi,&amp;nbsp;Y. (2019). The Curious Case of Neural Text Degeneration, arXiv preprint arXiv: 1904.09751. https://doi.org/10.48550/arXiv.1904.09751 (In English)</mixed-citation></ref><ref id="B9"><mixed-citation>Lavie, A.&amp;nbsp;&amp;amp;Agarwal, A. (2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments, Proceedings of the Second Workshop on Statistical Machine Translation, 228&amp;ndash;231. (In English).</mixed-citation></ref><ref id="B10"><mixed-citation>Li,&amp;nbsp;C., Zhang,&amp;nbsp;M. and He,&amp;nbsp;Y. (2022). The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models, arXiv preprint arXiv: 2108.06084v4. https://doi.org/10.48550/arXiv.2108.06084 (In English)</mixed-citation></ref><ref id="B11"><mixed-citation>Lin,&amp;nbsp;Ch-Y. (2004). Rouge: A package for automatic evaluation of summaries, Text summarization branches out, 74&amp;ndash;81. (In English)</mixed-citation></ref><ref id="B12"><mixed-citation>Liu,&amp;nbsp;X., Ji,&amp;nbsp;K., Fu,&amp;nbsp;Y., Lam Tam,&amp;nbsp;W., Du,&amp;nbsp;Zh., Yang,&amp;nbsp;Zh. and Tang,&amp;nbsp;J. (2022). P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-Tuning Universally Across Scales and Tasks, arXiv preprint arXiv: 2110.07602. https://doi.org/10.48550/arXiv.2110.07602 (In English)</mixed-citation></ref><ref id="B13"><mixed-citation>Margolina, A.V.&amp;nbsp;(2022). Controlling impression: making ruGPT3 generate sentiment-driven movie reviews, Journal of Applied Linguistics and Lexicography, Vol. 4., 1, 15-25. (In English)</mixed-citation></ref><ref id="B14"><mixed-citation>Margolina, A., Kolmogorova, A. (2023). Exploring evaluation techniques in controlled text generation: a comparative study of semantics and sentiment in ruGPTLarge-generated and human-written movie reviews, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference, 1082-1090. (In English).</mixed-citation></ref><ref id="B15"><mixed-citation>Mikhaylovskiy,&amp;nbsp;N. (2023). Long story generation challenge, Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, 10&amp;ndash;16. (In English)</mixed-citation></ref><ref id="B16"><mixed-citation>Mnih,&amp;nbsp;V., Kavukcuoglu,&amp;nbsp;K., Silver,&amp;nbsp;D. et&amp;nbsp;al. (2015). Human-level control through deep reinforcement learning, Nature, 518&amp;nbsp;(7540), 529&amp;ndash;533. http://dx.doi.org/10.1038/nature14236 (In English)</mixed-citation></ref><ref id="B17"><mixed-citation>Newmark,&amp;nbsp;P. (1987). Manual de traducci&amp;oacute;n. Madrid: Ediciones C&amp;aacute;tedra. (In Spanish)</mixed-citation></ref><ref id="B18"><mixed-citation>Novikova,&amp;nbsp;J., Lemon,&amp;nbsp;O. and Reiser,&amp;nbsp;V. (2016). Crowd-sourcing NLG data: Pictures elicit better data, Proceedings of 9th International Natural Language Generation Conference, 265&amp;ndash;273. DOI:&amp;nbsp;10.18653/v1/W16-6644 (In English)</mixed-citation></ref><ref id="B19"><mixed-citation>Obeidat,&amp;nbsp;A.&amp;nbsp;M., Ayyad,&amp;nbsp;G.&amp;nbsp;R., Sepora,&amp;nbsp;T. and Mahadi,&amp;nbsp;T. (2020). The tension between naturalness and accuracy in translating lexical collocations in literary text, Journal of Social Sciences and Humanities, 17&amp;nbsp;(8), 123&amp;ndash;134. (In English)</mixed-citation></ref><ref id="B20"><mixed-citation>Ore&amp;scaron;nik,&amp;nbsp;J. (2002). Naturalness in English: some (morpho)syntactic examples, Linguistica, 42. DOI: 10.4312/linguistica.42.1.143-160 (In English)</mixed-citation></ref><ref id="B21"><mixed-citation>Rachmawati,&amp;nbsp;S., Sukyadi,&amp;nbsp;D. and Samsudin,&amp;nbsp;D. (2021). Lexical cohesion in the commercial advertisements of five Korean magazines, Journal of Korean Applied Linguistics, 1&amp;nbsp;(1), 29&amp;minus;44. (In English)</mixed-citation></ref><ref id="B22"><mixed-citation>Rogers,&amp;nbsp;M. (1998). Naturalness and Translation, SYNAPS: Journal of Professional Communication, 2&amp;nbsp;(99), 9&amp;ndash;3. (In English)</mixed-citation></ref><ref id="B23"><mixed-citation>Schramm,&amp;nbsp;A. (1998). Tense and Aspect in Discourse, Studies in Second Language Acquisition, 20&amp;nbsp;(3), 433&amp;ndash;434. https://doi.org/10.1017/s0272263198283069 (In English)</mixed-citation></ref><ref id="B24"><mixed-citation>Schuff, H. &amp;amp; Vanderlyn, L. &amp;amp; Adel, H. &amp;amp; Vu, Th. (2023). How to do human evaluation: A brief introduction to user studies in NLP, Natural Language Engineering, 29, 1-24. DOI: 10.1017/S1351324922000535 (In English)</mixed-citation></ref><ref id="B25"><mixed-citation>Serce,&amp;nbsp;G. (2014). Relationship between naturalness and translations methods: Towards an objective characterization, Synergies Chili, 10, 139&amp;minus;153. (In English)</mixed-citation></ref><ref id="B26"><mixed-citation>Siipi,&amp;nbsp;H. (2008). Dimensions of Naturalness, Ethics and the Environment, 13&amp;nbsp;(1), 71&amp;minus;103. https://doi.org/10.2979/ETE.2008.13.1.71 (In English)</mixed-citation></ref><ref id="B27"><mixed-citation>Sinclair,&amp;nbsp;J. (1983).&amp;nbsp;Naturalness in language, in Aarts,&amp;nbsp;J. and Meys, W. (eds.), Corpus Linguistiсs, 203&amp;minus;210. (In English)</mixed-citation></ref><ref id="B28"><mixed-citation>Talmy, L. (2000). Toward a cognitive semantics, vol. 2: Typology and process in concept structuring. Cambridge, Mass.: MIT Press (In English)</mixed-citation></ref><ref id="B29"><mixed-citation>Thibault, P. J. (2011). First order languaging dynamics and second order language: The distributed language view, Educational Psychology, Vol.V, 32, 210&amp;ndash;245. (In English)</mixed-citation></ref><ref id="B30"><mixed-citation>Venuti,&amp;nbsp;L. (1995). The translator&amp;rsquo;s invisibility, Routledge, London and New York. (In English)</mixed-citation></ref><ref id="B31"><mixed-citation>Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter,D., Xia,F., Chi E., Le Qu., Zhou D. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, arXiv:2201.11903. https://doi.org/10.48550/arXiv.2201.11903 (In English)</mixed-citation></ref><ref id="B32"><mixed-citation>Wilson,&amp;nbsp;D. (1998). Discourse, coherence and relevance: A reply to Rachel Giora, Journal of Pragmatics, 29&amp;nbsp;(1), 57&amp;minus;74. (In English)</mixed-citation></ref><ref id="B33"><mixed-citation>Zhang,&amp;nbsp;T., Kishore,&amp;nbsp;V., Wu,&amp;nbsp;F., Weinberger,&amp;nbsp;K.&amp;nbsp;Q. and Artzi,&amp;nbsp;Y. (2020). BERTscore: Evaluating text generation with BERT, arXiv preprint arXiv: 1904.09675. https://doi.org/10.48550/arXiv.1904.09675 (In English)</mixed-citation></ref><ref id="B34"><mixed-citation>Zhou,&amp;nbsp;J. and Bha,&amp;nbsp;S. (2021). Paraphrase generation: A survey of the state of the art, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 5075&amp;ndash;5086. (In English)</mixed-citation></ref><ref id="B35"><mixed-citation>Zhu,&amp;nbsp;Y., Lu,&amp;nbsp;S., Zheng,&amp;nbsp;L., Guo,&amp;nbsp;J., Zhang,&amp;nbsp;W., Wang,&amp;nbsp;J. and Yu,&amp;nbsp;Y. (2018). Texygen: A Benchmarking Platform for Text Generation Models, arXiv preprint arXiv: 1802.01886. https://doi.org/10.48550/arXiv.1802.01886 (In English)</mixed-citation></ref></ref-list></back></article>