<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<article article-type="research-article" dtd-version="1.2" xml:lang="ru" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><front><journal-meta><journal-id journal-id-type="issn">2313-8912</journal-id><journal-title-group><journal-title>Научный результат. Вопросы теоретической и прикладной лингвистики</journal-title></journal-title-group><issn pub-type="epub">2313-8912</issn></journal-meta><article-meta><article-id pub-id-type="doi">10.18413/2313-8912-2024-10-3-0-7</article-id><article-id pub-id-type="publisher-id">3546</article-id><article-categories><subj-group subj-group-type="heading"><subject>ПРИКЛАДНАЯ ЛИНГВИСТИКА</subject></subj-group></article-categories><title-group><article-title>&lt;strong&gt;Новый графовый подход к генерации текстов узкой предметной области на естественном языке&lt;/strong&gt;</article-title><trans-title-group xml:lang="en"><trans-title>&lt;strong&gt;A graph-based approach to closed-domain natural language generation&lt;/strong&gt;</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Фирсанова</surname><given-names>Виктория Игоревна</given-names></name><name xml:lang="en"><surname>Firsanova</surname><given-names>Victoria I.</given-names></name></name-alternatives><email>st085687@student.spbu.ru</email><xref ref-type="aff" rid="aff1" /></contrib></contrib-group><aff id="aff1"><institution>Санкт-Петербургский государственный университет, Санкт-Петербург, Россия</institution></aff><pub-date pub-type="epub"><year>2024</year></pub-date><volume>10</volume><issue>3</issue><fpage>0</fpage><lpage>0</lpage><self-uri content-type="pdf" xlink:href="/media/linguistics/2024/3/ВТиПЛ_2024_3_135-167.pdf" /><abstract xml:lang="ru"><p>Обработка естественного языка на основе графов в последние годы становится актуальной благодаря развитию больших языковых моделей и генерации, дополненной информационным поиском. Большие языковые модели &amp;ndash; это сложные алгоритмы, которые распознают многочисленные задачи обработки естественного языка путем анализа инструкций пользователей на естественном языке. Однако их промышленное использование вызывает сомнения из-за таких этических проблем, как создание ложной информации, высокого риска утечки данных и авторских заимствований. В статье представлена новая архитектура для обработки естественного языка, поблочная генерация на основе графов, которая использует самые современные методы глубокого обучения, возможности механизмов внимания, дистрибутивной семантики, информационного поиска на основе графов и децентрализованные сети. Модель кодирует запросы пользователя для снижения риска утечки данных, извлекает релевантную информацию из базы знаний графа и формирует блок для обусловленного моделирования языка с использованием больших языковых моделей. Модель направлена на разрешение ситуации недостатка данных для обучения полноценной модели машинного обучения. Исследование представляет новый набор данных на основе графов. Набор данных задает признаки уязвимых персональных данных для кодирования и текстовую информацию закрытой предметной области для информационного поиска. Он используется для обучения и оценки модели поблочной генерации на основе графов, впервые представленной в данной статье. Модель позволяет сократить объем обучающих данных более чем в 100 раз, достигная значения метрики оценки перплексии ~6,51 в задаче генерации естественного языка и F1-меры ~90,3 в задаче извлечения информации, что сопоставимо с большинством современных языковых моделей. Результаты экспериментов доказывают эффективность предлагаемого метода и вносят вклад в разработку алгоритмических подходов к снижению рисков использования больших языковых моделей в промышленности.</p></abstract><trans-abstract xml:lang="en"><p>Graph-based Natural Language Processing (NLP) methods have seen significant advancements in recent years with the development of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). LLMs are sophisticated models that recognize numerous NLP tasks by analyzing the users&amp;#39; natural language instructions called prompts. However, their industrial use is questionable due to such ethical concerns as false information generation called hallucinations, high risks of data breaches, and plagiarism. The paper introduces a novel NLP architecture, the Graph-Based Block-to-Block Generation (G3BG), which leverages state-of-the-art deep learning techniques, the power of attention mechanisms, distributional semantics, graph-based information retrieval, and decentralized networks. The model encodes user prompts to mitigate data breach risk, retrieves relevant information from a graph knowledge base, and forms a block for a conditional language model using LLMs to perform a new secure type of RAG. The model is closed-domain and small-scale oriented. It exhibits superior performance across low-resource NLP tasks, which makes it prominent for industrial use. The research presents a novel graph-based dataset. The dataset comprises private data features to encode and closed-domain textual information for information retrieval. The dataset is used to train and evaluate the G3BG model. The model allows cutting 100x training dataset volume achieving Perplexity ~6.51 on the Language Generation task and F1-Score ~90.3 on the Information Retrieval task comparable to most state-of-the-art language models. The experimental results prove the effectiveness of the proposed method and contribute to the algorithmic approaches toward LLM risk mitigation.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>Генерация естественного языка</kwd><kwd>Понимание естественного языка</kwd><kwd>Генеративный искусственный интеллект</kwd><kwd>Большие языковые модели</kwd><kwd>Децентрализованные сети</kwd><kwd>Кодирование данных</kwd><kwd>Дистрибутивная семантика</kwd><kwd>Закрытая предметная область</kwd></kwd-group><kwd-group xml:lang="en"><kwd>Language Generation</kwd><kwd>Language Understanding</kwd><kwd>Generative Artificial Intelligence</kwd><kwd>Large Language Models</kwd><kwd>Decentralized Networks</kwd><kwd>Data Encoding</kwd><kwd>Distributional Semantics</kwd><kwd>Closed-Domain Systems</kwd></kwd-group></article-meta></front><back><ref-list><title>Список литературы</title><ref id="B1"><mixed-citation>Andriushchenko&amp;nbsp;M., Flammarion&amp;nbsp;N. Does Refusal Training in LLMs Generalize to the Past Tense? arXiv preprint arXiv:2407.11969. 2024. P.&amp;nbsp;16. DOI: 10.48550/arXiv.2407.11969</mixed-citation></ref><ref id="B2"><mixed-citation>Anthropic. Claude 3.5 Sonnet Model Card Addendum, 2024. URL: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf (дата обращения: 06.09.2024).</mixed-citation></ref><ref id="B3"><mixed-citation>Ayyamperumal&amp;nbsp;S.&amp;nbsp;G., Ge&amp;nbsp;L. Current state of LLM Risks and AI Guardrails. arXiv preprint arXiv:2406.12934. 2024. P. 9. DOI: 10.48550/arXiv.2406.12934</mixed-citation></ref><ref id="B4"><mixed-citation>Choi&amp;nbsp;E. Prompt injection: Parameterization of fixed inputs / Choi E., Jo Y., Jang J., Seo M. arXiv preprint arXiv:2206.11349. 2022. DOI: 10.48550/arXiv.2206.11349</mixed-citation></ref><ref id="B5"><mixed-citation>Christiano&amp;nbsp;P.&amp;nbsp;F. Deep reinforcement learning from human preferences / P.&amp;nbsp;F.&amp;nbsp;Christiano, J.&amp;nbsp;Leike, T.&amp;nbsp;B. Brown, M.&amp;nbsp;Martic, S.&amp;nbsp;Legg, D.&amp;nbsp;Amodei // Advances in neural information processing systems. 2017. V.&amp;nbsp;30. Pp.&amp;nbsp;1&amp;ndash;9. DOI: 10.5555/3294996.3295184</mixed-citation></ref><ref id="B6"><mixed-citation>Dettmers&amp;nbsp;T. QLoRA: Efficient finetuning of quantized LLMs / Dettmers&amp;nbsp;T., Pagnoni&amp;nbsp;A., Holtzman&amp;nbsp;A., Zettlemoyer&amp;nbsp;L. // Advances in Neural Information Processing Systems. 2024. V.&amp;nbsp;36. Pp.&amp;nbsp;1&amp;ndash;28. DOI: 10.48550/arXiv.2305.14314</mixed-citation></ref><ref id="B7"><mixed-citation>Devlin&amp;nbsp;J. BERT: Pre-training of deep bidirectional transformers for language understanding / Devlin&amp;nbsp;J., Chang&amp;nbsp;M.&amp;nbsp;W., Lee&amp;nbsp;K., Toutanova&amp;nbsp;K. // Proceedings of NAACL-HLT. 2019. Pp.&amp;nbsp;4171&amp;ndash;4186. DOI: 10.48550/arXiv.1810.04805</mixed-citation></ref><ref id="B8"><mixed-citation>Dong Y. Building Guardrails for Large Language Models / Dong Y., Mu R., Jin G., Qi Y., Hu J., Zhao X., Meng J., Ruan W. and Huang X. // arXiv preprint arXiv:2402.01822. 2024. DOI: 10.48550/arXiv.2402.01822</mixed-citation></ref><ref id="B9"><mixed-citation>Firsanova&amp;nbsp;V. Towards building a mobile app for people on the spectrum // Companion Proceedings of the ACM Web Conference 2023. 2023. Pp.&amp;nbsp;555&amp;ndash;559. DOI: 10.1145/3543873.3587533</mixed-citation></ref><ref id="B10"><mixed-citation>Firsanova&amp;nbsp;V. The advantages of human evaluation of sociomedical question answering systems // International Journal of Open Information Technologies. 2021. V.&amp;nbsp;9. №&amp;nbsp;12. Pp.&amp;nbsp;53&amp;ndash;59. DOI: 10.25559/INJOIT.2307-8162.09.202112.53-59</mixed-citation></ref><ref id="B11"><mixed-citation>Gage&amp;nbsp;P. A new algorithm for data compression // The C Users Journal. 1994. V.&amp;nbsp;12. №.&amp;nbsp;2. Pp.&amp;nbsp;23&amp;ndash;38.</mixed-citation></ref><ref id="B12"><mixed-citation>Gao&amp;nbsp;J., Galley&amp;nbsp;M., Li&amp;nbsp;L. Neural approaches to conversational AI // The 41st international ACM SIGIR conference on research &amp;amp; development in information retrieval. 2018. Pp.&amp;nbsp;1371&amp;ndash;1374. DOI: 10.1145/3209978.3210183</mixed-citation></ref><ref id="B13"><mixed-citation>Goodfellow&amp;nbsp;I., Bengio&amp;nbsp;Y., Courville&amp;nbsp;A. Deep learning. MIT press, 2016. P.&amp;nbsp;781.</mixed-citation></ref><ref id="B14"><mixed-citation>Google Cloud. Cloud Computing Services, 2024. URL: https://cloud.google.com/ (дата обращения: 06.09.2024).</mixed-citation></ref><ref id="B15"><mixed-citation>Guu K. Retrieval augmented language model pre-training / Guu L., Lee K, Tung Z, Pasupat P, Chang M. // InInternational conference on machine learning. Pp. 3929&amp;ndash;3938.</mixed-citation></ref><ref id="B16"><mixed-citation>Hendrycks D. Measuring massive multitask language understanding / Hendrycks D., Burns C., Basart S., Zou A., Mazeika M., Song D., Steinhardt J. arXiv preprint arXiv:2009.03300. 2020. P. 27. DOI: 10.48550/arXiv.2009.03300</mixed-citation></ref><ref id="B17"><mixed-citation>Hewitt&amp;nbsp;J., Manning&amp;nbsp;P.&amp;nbsp;D. A structural probe for finding syntax in word representations // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. Pp.&amp;nbsp;4129&amp;ndash;4138. DOI: 10.18653/v1/N19-1419</mixed-citation></ref><ref id="B18"><mixed-citation>Hu&amp;nbsp;E.&amp;nbsp;J. Lora: Low-rank adaptation of large language models / Hu&amp;nbsp;E.&amp;nbsp;J., Shen&amp;nbsp;Y., Wallis&amp;nbsp;P., Allen-Zhu&amp;nbsp;Z., Li&amp;nbsp;Y., Wang&amp;nbsp;S., Wang&amp;nbsp;L., Chen&amp;nbsp;W. arXiv preprint arXiv:2106.09685. 2021. P.&amp;nbsp;26. DOI: 10.48550/arXiv.2106.09685</mixed-citation></ref><ref id="B19"><mixed-citation>Jacob B. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference / Jacob B., Kligys S., Chen B., Zhu M., Tang M., Howard A., Adam H., Kalenichenko D. arXiv preprint arXiv:1712.05877. 2018. P. 14. DOI: 10.48550/arXiv.1712.05877</mixed-citation></ref><ref id="B20"><mixed-citation>Ji, Z. Survey of hallucination in natural language generation / Ji Z., Lee N., Frieske R., Yu T., Su D., Xu Y., Ishii E., Bang Y., Chen D., Dai W., Chan H. S., Madotto A., Fung P. // ACM Computing Surveys. 2023. V. 55. № 12. Pp. 1&amp;ndash;38.</mixed-citation></ref><ref id="B21"><mixed-citation>Jiang&amp;nbsp;A.&amp;nbsp;Q. Mistral 7B / Jiang&amp;nbsp;A.&amp;nbsp;Q., Sablayrolles&amp;nbsp;A., Mensch&amp;nbsp;A., Bamford&amp;nbsp;C., Chaplot&amp;nbsp;D.&amp;nbsp;S., Casas&amp;nbsp;D.&amp;nbsp;D., Bressand&amp;nbsp;F., Lengyel&amp;nbsp;G., Lample&amp;nbsp;G., Saulnier&amp;nbsp;L., Lavaud&amp;nbsp;L.&amp;nbsp;R. arXiv preprint arXiv:2310.06825. 2023. P.&amp;nbsp;9. DOI: 10.48550/arXiv.2310.06825</mixed-citation></ref><ref id="B22"><mixed-citation>Jelinek&amp;nbsp;F. Perplexity &amp;ndash; a measure of the difficulty of speech recognition tasks / Jelinek&amp;nbsp;F., Mercer&amp;nbsp;R.&amp;nbsp;L., Bahl&amp;nbsp;L.&amp;nbsp;R., Baker&amp;nbsp;J.&amp;nbsp;K. // The Journal of the Acoustical Society of America. 1977. V.&amp;nbsp;62. №.&amp;nbsp;S1. Pp.&amp;nbsp;S63&amp;ndash;S63.</mixed-citation></ref><ref id="B23"><mixed-citation>Jurafsky&amp;nbsp;D., Martin&amp;nbsp;J.&amp;nbsp;H. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Stanford University, University of Colorado at Boulder. 2023. P.&amp;nbsp;577.</mixed-citation></ref><ref id="B24"><mixed-citation>LM Studio. LM Studio Documentation, 2024. URL: https://lmstudio.ai/docs/welcome (дата обращения: 06.09.2024).</mixed-citation></ref><ref id="B25"><mixed-citation>Luo&amp;nbsp;H., Luo&amp;nbsp;J., Vasilakos&amp;nbsp;A.&amp;nbsp;V. BC4LLM: Trusted artificial intelligence when blockchain meets large language model. arXiv preprint arXiv:2310.06278. 2023. P.&amp;nbsp;42. DOI: 10.48550/arXiv.2310.06278</mixed-citation></ref><ref id="B26"><mixed-citation>McCarthy J. Generality in artificial intelligence // Communications of the ACM. 1987. V. 30. № 12. Pp. 1030&amp;ndash;1035.</mixed-citation></ref><ref id="B27"><mixed-citation>Meister C., Cotterell R. Language Model Evaluation Beyond Perplexity // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. Pp. 5328&amp;ndash;5339.</mixed-citation></ref><ref id="B28"><mixed-citation>Mikolov&amp;nbsp;T. Efficient estimation of word representations in vector space / T.&amp;nbsp;Mikolov, Chen&amp;nbsp;K., Corrado&amp;nbsp;G., Dean&amp;nbsp;J. arXiv preprint arXiv:1301.3781. 2013. P.&amp;nbsp;12. DOI: 10.48550/arXiv.1301.3781</mixed-citation></ref><ref id="B29"><mixed-citation>Mistral. Mistral Large 2, 2024. URL: https://mistral.ai/news/mistral-large-2407/ (дата обращения: 06.09.2024).</mixed-citation></ref><ref id="B30"><mixed-citation>Morris, J., Hirst, G. Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text // Computational Linguistics. 1991. V. 17. № 1. Pp. 21&amp;ndash;48.</mixed-citation></ref><ref id="B31"><mixed-citation>Ouyang&amp;nbsp;L. Training language models to follow instructions with human feedback / Ouyang&amp;nbsp;L., Wu&amp;nbsp;J., Jiang&amp;nbsp;X., Almeida&amp;nbsp;D., Wainwright&amp;nbsp;C., Mishkin&amp;nbsp;P., Zhang&amp;nbsp;C., Agarwal&amp;nbsp;S., Slama&amp;nbsp;K., Ray&amp;nbsp;A., Schulman&amp;nbsp;J. // Advances in Neural Information Processing Systems. 2022. V.&amp;nbsp;35. Pp.&amp;nbsp;27730-27744. DOI: 10.48550/arXiv.2203.02155</mixed-citation></ref><ref id="B32"><mixed-citation>OpenAI. GPT-4o mini: advancing cost-efficient intelligence, 2024. URL: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (дата обращения: 06.09.2024).</mixed-citation></ref><ref id="B33"><mixed-citation>OpenAI API. Open AI API, 2024. URL: https://openai.com/index/openai-api/ (дата обращения: 06.09.2024).</mixed-citation></ref><ref id="B34"><mixed-citation>Polyzotis&amp;nbsp;N., Zaharia&amp;nbsp;M. What can data-centric AI learn from data and ML engineering? arXiv preprint arXiv:2112.06439. 2021. P.&amp;nbsp;5.</mixed-citation></ref><ref id="B35"><mixed-citation>Priest, G. Logic: A Very Short Introduction. Oxford University Press. 2000. P. 160.</mixed-citation></ref><ref id="B36"><mixed-citation>Raffel&amp;nbsp;C. Exploring the limits of transfer learning with a unified text-to-text transformer / Raffel&amp;nbsp;C., Shazeer&amp;nbsp;N., Roberts&amp;nbsp;A., Lee&amp;nbsp;K., Narang&amp;nbsp;S., Matena&amp;nbsp;M., Zhou&amp;nbsp;Y., Li&amp;nbsp;W., Liu&amp;nbsp;P.&amp;nbsp;J. // Journal of machine learning research. 2020. V.&amp;nbsp;21. №.&amp;nbsp;140. Pp.&amp;nbsp;1&amp;ndash;67.</mixed-citation></ref><ref id="B37"><mixed-citation>Rajpurkar&amp;nbsp;P. SQuAD: 100,000+ questions for machine comprehension of text / Rajpurkar&amp;nbsp;P., Zhang&amp;nbsp;J., Lopyrev&amp;nbsp;K., Liang&amp;nbsp;P. arXiv preprint arXiv:1606.05250. 2016. P.&amp;nbsp;10. DOI: 10.48550/arXiv.1606.05250</mixed-citation></ref><ref id="B38"><mixed-citation>Rajpurkar&amp;nbsp;P., Jia&amp;nbsp;R., Liang&amp;nbsp;P. Know what you don&amp;#39;t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822. 2018. P.&amp;nbsp;9. DOI: 10.48550/arXiv.1806.03822</mixed-citation></ref><ref id="B39"><mixed-citation>Ruder&amp;nbsp;S. Neural transfer learning for natural language processing. NUI Galway. 2019. P. 330.</mixed-citation></ref><ref id="B40"><mixed-citation>Schmidhuber&amp;nbsp;J. Evolutionary principles in self-referential learning, or on learning how to learn. Technische Universit&amp;auml;t M&amp;uuml;nchen. 1987. P.&amp;nbsp;64.</mixed-citation></ref><ref id="B41"><mixed-citation>Talmor&amp;nbsp;A. Commonsenseqa: A question answering challenge targeting commonsense knowledge / Talmor&amp;nbsp;A., Herzig&amp;nbsp;J., Lourie&amp;nbsp;N., Berant&amp;nbsp;J. arXiv preprint arXiv:1811.00937. 2018. P.&amp;nbsp;10. DOI: 10.48550/arXiv.1811.00937</mixed-citation></ref><ref id="B42"><mixed-citation>Thakur N. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models / Thakur N., Reimers N., R&amp;uuml;ckl&amp;eacute; A., Srivastava A., Gurevych I. arXiv preprint arXiv:2104.08663. P. 24. DOI: 10.48550/arXiv.2104.08663bs/2104.08663</mixed-citation></ref><ref id="B43"><mixed-citation>Van&amp;nbsp;Rijsbergen&amp;nbsp;P.&amp;nbsp;J. Information Retrieval. London: Butterworths. 1979. P. 147.</mixed-citation></ref><ref id="B44"><mixed-citation>Vaswani&amp;nbsp;A. Attention is all you need / Vaswani&amp;nbsp;A., Shazeer&amp;nbsp;N., Parmar&amp;nbsp;N., Uszkoreit&amp;nbsp;J., Jones&amp;nbsp;L., Gomez&amp;nbsp;A.&amp;nbsp;N. // Advances in neural information processing systems. 2017. Т.&amp;nbsp;30. Pp.&amp;nbsp;261&amp;ndash;272. DOI: 10.48550/arXiv.1706.03762</mixed-citation></ref><ref id="B45"><mixed-citation>Wolf&amp;nbsp;T. HuggingFace&amp;#39;s Transformers: State-of-the-art natural language processing / Wolf&amp;nbsp;T., Debut&amp;nbsp;L., Sanh&amp;nbsp;V., Chaumond&amp;nbsp;J., Delangue&amp;nbsp;C., Moi&amp;nbsp;A., Cistac&amp;nbsp;P., Rault&amp;nbsp;T., Louf&amp;nbsp;R., Funtowicz&amp;nbsp;M., Davison&amp;nbsp;J. arXiv preprint arXiv:1910.03771. 2019. P.&amp;nbsp;8. DOI: 10.48550/arXiv.1910.03771</mixed-citation></ref><ref id="B46"><mixed-citation>Zhang&amp;nbsp;P. Retrieve anything to augment large language models. / Zhang&amp;nbsp;P., Xiao&amp;nbsp;S., Liu&amp;nbsp;Z., Dou&amp;nbsp;Z., Nie&amp;nbsp;J. Y. arXiv preprint arXiv:2310.07554. 2023. P.&amp;nbsp;16. DOI: 10.48550/arXiv.2310.07554</mixed-citation></ref><ref id="B47"><mixed-citation>Zhong&amp;nbsp;W. AGIEval: A human-centric benchmark for evaluating foundation models / Zhong&amp;nbsp;W., Cui&amp;nbsp;R., Guo&amp;nbsp;Y., Liang&amp;nbsp;Y., Lu&amp;nbsp;S., Wang&amp;nbsp;Y., Saied&amp;nbsp;A., Chen&amp;nbsp;W., Duan&amp;nbsp;N. arXiv preprint arXiv:2304.06364. 2023. P.&amp;nbsp;22. DOI: 10.48550/arXiv.2304.06364</mixed-citation></ref></ref-list></back></article>