16+
DOI: 10.18413/2313-8912-2024-10-4-0-3

Prompt injection – the problem of linguistic vulnerabilities of large language models at the present stage

Abstract

The article examines the phenomenon of “prompt injection” in the context of contemporary large language models (LLMs), elucidating a significant challenge for AI developers and researchers. The study comprises a theoretical and methodological review of scholarly publications, thereby enhancing the comprehension of the present state of research in this field. The authors present the findings of a case study, which employs a comparative analysis of the linguistic vulnerabilities of prominent LLMs, including Chat GPT 4.0, Claude 3.5, and Yandex GPT. The study employs experimental evaluation to assess the resilience of these models against a range of vector attacks, with the objective of determining the extent to which each model resists manipulative prompts designed to exploit their linguistic capabilities. A taxonomy of prompt injection attack types was developed based on the collected data, with classification according to effectiveness and targeting of specific LLMs. This classification facilitates comprehension of the nature of these vulnerabilities and provides a basis for future research in this field. Moreover, the article offers suggestions for bolstering the resilience of language models against negative manipulations, representing a significant stride towards the development of safer and more ethical AI systems. These recommendations are based on empirical data and aim to provide practical guidance for developers seeking to enhance the resilience of their models against potential threats. The research findings extend our understanding of linguistic vulnerabilities in LLMs, while also contributing to the development of more effective defence strategies. These have practical implications for the deployment of LLMs across various domains, including education, healthcare and customer service. The authors emphasise the necessity for continuous monitoring and improvement of language model security in an ever-evolving technological landscape. The findings suggest the necessity for an ongoing dialogue among stakeholders to address issues pertaining to the prompt injection of funds.


Introduction

In recent years, research in the field of natural language processing has made significant progress, attracting widespread interest not only in academic circles but also beyond. Consequently, a considerable number of companies and research laboratories have commenced the active implementation of large language models through application programming interfaces (APIs) and chatbots. These tools, based on deep learning methods, are capable of addressing a wide range of tasks – from text generation, classification, and summarization to scriptwriting and error correction in various programming languages. Notable large language models include GPT-4 by OpenAI, LLaMA by Meta, and Bard by Google.

The shift toward using simple and user-friendly interfaces (such as chat-based prompts) for large language models has greatly contributed to democratizing access to artificial intelligence, allowing individuals and organizations to utilize powerful natural language processing tools that were previously available only to specialists with deep computer science knowledge and computational resources available to the wealthiest organizations (Rossi et al, 2024).

In the context of rapid advancements in artificial intelligence, particularly large language models (LLMs), prompt engineering has become a critical skill for effective interaction with linguistic tools. This approach allows for the implementation of rules and automation of processes, ensuring both high quality and quantity of LLM-generated outputs. The order in which examples are provided in prompts, automatic generation of instructions, and methods for their selection have proven to be significant for enhancing LLM performance. It is noteworthy that automatically generated instructions exhibit a quality that is comparable to, or even exceeds, that of instructions annotated by humans. They outperform standard LLM benchmarks, thus making prompt engineering a programmable procedure for fine-tuning outputs and interacting with LLMs (Marvin et al., 2023).

One of the most pressing issues has already become the phenomenon of “prompt injection” – a specific type of attack on linguistic vulnerabilities in publicly accessible LLMs. The aim of this research is to analyze this critical issue and develop author’s recommendations for enhancing protective methods. In a broader sense, the phenomenon of prompt injection generally is a complex array of various types of network attacks. For instance, today we can provide the following list: Indirect Prompt Injection Threats, Jailbreak, Prompt Leaking, SQL Injection, and Vulnerabilities in API. However, this study focuses only on a specific type of semantic attacks on LLMs (Liu et al, 2024). The core issue lies in the fact that an attacker can formulate “hidden” instructions within a model's query, which will alter the output generated by the LLM against its intended purpose and built-in limitations (Yan et al, 2024). These manipulations challenge and jeopardize the fundamental paradigm of secure human-AI interaction, undermining trust in these advanced models.

What are the reasons for regarding prompt injection as a significant threat? Firstly, contemporary LLMs demonstrate a high degree of generalization (Khandelwa et al., 2019), exhibiting the capacity to interpret intricate linguistic structures across a vast array of subjects. Second, they often have access to confidential information and can perform tasks critical to public welfare. Third, the main technical features of interactive LLM operations enable various types of “linguistic attacks” [Chang et al, 2024]. The issue of prompt injection raises fundamental questions about the nature of language and communication in the context of artificial intelligence. Can we create a truly secure LLM today? What are the main issues of “linguistic security” (Tavabi et al, 2018) that developers need to address at this stage of programming and tuning LLMs? These questions require a systemic solution, where the authors see a key methodological role in a transdisciplinary approach combining linguistics [Röttger et al, 2024], the philosophy of language (Zhang, 2024), and computer science.

Large language models have been deployed in a multitude of domains, including web applications, where they facilitate human interaction through chatbots with natural language interfaces. Internally, using middleware for LLM integration, such as Langchain, user queries are converted into prompts, which are then used by the LLM to provide meaningful responses to users. However, unsanitized user queries can lead to attacks using prompt injection, potentially compromising the security of the database. The new ability to customize models to meet specific needs has opened up new horizons for AI applications. The injection of prompts allows an attacker to not only extract prompts configured by the user but also gain access to uploaded files (Yu et al., 2023). Despite growing interest in vulnerabilities related to prompt injection in LLM queries, specific risks associated with generating prompt injection attacks remain insufficiently studied (Pedro et al, 2023).

Currently, numerous defense algorithms are under active development to counter such attacks on LLMs. Relevant topical proposals are based on a variety of algorithmic solutions and offer different validated methods to optimize security (analyses of defensive ‘alignment’ algorithms, MIA attacks, potential of prompts decomposition, etc.). (Chen et al, 2024, Duan et al, 2024, Li et al, 2024).

At the same time, since this study does not touch upon the issues of attacks involving the use/injection of malicious code, the experimental part primarily focuses on the problem of prompt jailbreaks, following established terminological conventions in academic practice (Yu et al, 2024). For example, an attacker can input a manipulative instruction, bypassing established restrictions and disguising it as a regular question or request, in order to prompt the model to provide prohibited information. Such phrasing may encourage the model to reveal methods for circumventing protective measures or identify vulnerabilities in the system, as well as the type of query disguised by using a security modification scenario. The model, "unaware" of the true intent of the query, may provide information or advice that contradicts its intended purpose.

Therefore, this study positions itself at the intersection of linguistic security and AI ethics, fostering a deeper understanding of the threats posed by "prompt injection" and contributing to the development of reliable security protocols for potential LLM applications.

Main Section

Purpose statement

To study the phenomenon of prompt injection, the authors implemented a series of experiments (case studies) with various publicly available LLMs. The purpose of the study is to create a generalized taxonomy of the possibilities of linguistic prompt injection attacks and their effectiveness against various popular LLMs, and based on the case study results, to develop a set of author’s recommendations for enhancing LLM resilience to such linguistic manipulations.

The implementation of the objective of the work implies the solution of the following tasks:

1. Vulnerability analysis (case study): testing three primary publicly available and popular models (ChatGPT, Claude, Gemini) for resilience to various types of “injections”, including both direct syntactically complex commands, task execution conditions, and hidden instructions, as well as variable semantic manipulations (Hines et al, 2024).

2. Development of the author’s recommendations based on the results of experiments: evaluating the effectiveness of data filtering and input optimisation techniques, as well as basic linguistic “immunization” methods for LLMs, creating recommendations for protecting against this attack vector (Mudarova, Namiot, 2024).

Material and methods of research

To conduct the authors' experiments, three key directions were identified for the case study in the context of the stated topic. Each direction was selected based on its relevance and potential impact on society and technology.

1. Software engineering solutions that can be used for cyberattacks/ unauthorized access to LLM control (HackCode). In this area, we explored various tools and techniques that could be employed by malicious actors to carry out cyberattacks. We analyzed existing vulnerabilities in popular software products and assessed how generative models might assist in creating malicious code. The study included an analysis of real-world cyberattack cases and an evaluation of the potential of generative models in automating the process of creating malware.

2. Circumventing restrictions in creating adult content (AdultContent). This area focuses on investigating how generative models can be used to create content that falls under restrictions and censorship. We conducted experiments with various prompts to determine how effectively models can bypass built-in filters and limitations. A key aspect of this research was understanding the ethical and legal consequences of such actions, as well as their potential societal impact.

3. Creating fake/manipulated content of various types (FakeNews). In this direction, we focused on the generation of disinformation and fake news. We examined how generative models can be used to create plausible but false news stories, and analyzed how such materials can influence public opinion and information perception. The experiments included the creation of various types of fake news, from political to social, with the aim of assessing their impact on target audiences.

In order to examine the performance of the three tested LLMs in each direction, the authors devised 20 queries for each direction, which were then used in the experiment.

In each of the aforementioned areas, three main categories of final results were defined:

1. Negative (neg) – the LLM refused to generate output based on the user's request for any reason. This category includes cases where the model declines to fulfill the request due to built-in limitations or ethical considerations.

2. Positive (pos) – the LLM generated output fully in response to the user's request. This category considers successful results where the model was able to fulfill the user’s request without any limitations or errors.

3. Error (err) – the LLM exhibited some form of explicit error (hallucinations, generation failure, evident mistakes in output). This category includes cases where the model produces irrelevant or incorrect results, which may indicate issues in its training or architecture.

The tables below summarise the results of the authors' experiment with partial citations of successful “prompt injection/jailbreak”, illustrating the frequency of each result category depending on the research area. These data will help deepen the understanding of the capabilities and limitations of LLMs in the context of the defined topics.

Table 1. General indicators

Таблица 1. Общие показатели

Table 2. Examples of successful prompt injection[1]

Таблица 2. Примеры успешных prompt injection


Results and discussion

In this experiment, a linguistic analysis of prompts aimed at generating various texts using large language models (LLMs) such as Chat GPT-4, Claude 3.5, and Yandex GPT was conducted. The goal of the research was to assess how susceptible these models are to executing prompts, including potentially unethical or rule-violating requests, and to evaluate their ability to create texts that may violate established constraints.

Most of the prompts included instructions to simulate specific situations or events, requiring the models to accurately understand the requested information and apply the appropriate lexical and semantic units. The prompts aimed at LLM generation were focused on creating texts on sensitive, provocative, or unethical topics, using obscene language, and producing literary works imitating the style of well-known but controversial authors. The results of the analysis indicated that the models generally recognize and respond effectively to prompts that may represent attempts to inject malicious commands (prompt injection/jailbreak). Specifically, GPT demonstrated the ability to handle complex vocabulary and create texts that meet professional standards. For example, when handling literary prompts, the model successfully imitated the style of the requested authors (e.g., Marquis de Sade, William Burroughs), while avoiding excessively explicit content whenever possible. This highlights the model’s high degree of adaptability and cognitive flexibility in responding to various thematic prompts. Nevertheless, when confronted with explicitly unethical requests, the model typically refuses to fulfill them or points out the inadmissibility of creating such texts. In cases where prompts exceed acceptable boundaries, GPT enforces restrictions, aiming to minimize undesirable content.

Yandex GPT employs particularly cautious language when dealing with illegal or ethically questionable topics. In cases where fulfilling a request is feasible, the model prefers neutral and general language, avoiding specific details that might violate its usage policy. In the vast majority of cases during the author’s experiment, Yandex GPT refused to provide responses to provocative or unethical requests, emphasizing the non-compliance of such requests with the model’s ethical or political guidelines. Instead, the model suggests changing the topic or redirecting the query toward something more acceptable. When the model detects potential risks or dangers, it employs cautionary phrases to alert the user to the inadmissibility of such actions.

Claude 3.5 and Yandex GPT, as demonstrated by the author’s experiment, show the highest capacity to restrict the creation of “unacceptable” content – either by transforming the requests or by declining to fulfill them. However, under certain conditions, such as mixing permissible and impermissible topics, the LLM may partially succumb to manipulation and generate content that goes beyond the basic ethical standards or restrictions. This highlights the need to strengthen filters and introduce additional mechanisms to prevent such situations.

Despite their overall resilience, there were instances where language models were partially manipulated, especially with complex, combined queries. This indicates the importance of further improving protective mechanisms. In light of these observations, the following recommendations can be made to enhance the security and reliability of tested models while maintaining their high cognitive abilities in generating persuasive and credible content:

1. There is a need to strengthen filters that monitor and prevent the execution of requests that may violate ethical norms or the model’s usage policy. Special attention should be given to complex and multi-level queries that could be used to bypass restrictions.

2. Models should develop a stronger capability for dynamic context assessment in queries to more effectively identify hidden user intentions and prevent manipulation attempts.

3. As prompt injection and jailbreak tactics continue to evolve, it is crucial to regularly update the models' query recognition algorithms, adapting them to new threats and improving their ability to recognize manipulative and provocative promts.

4. Continued training of models with a focus on the ethical aspects of text generation is essential to ensure that models not only recognize but also respond appropriately to requests that may be potentially dangerous or unethical.

Conclusions

In summary, the analysis demonstrates that the models Claude 3.5 and Yandex GPT are more effectively protected against prompt injection attempts compared to Chat GPT 4o, showing resilience to provocative and manipulative queries. The models exercise caution, avoiding responses to requests that violate usage policies, and maintain a neutral and safe style in their answers. The language and semantics of their responses indicate that, in most cases, the models recognize dangerous queries and effectively prevent their execution. However, several recommendations should be followed to make working with LLMs more efficient and secure.


[1] To illustrate the results of the authors’ experiment on “Prompt injection – the problem of linguistic vulnerabilities of large language models at the present stage”, the authors present partial results of the study (three blocks of 20 prompts each) on three topics (HackCode, AdultContent and FakeNews) using the outputs generated by different large language models involved in the authors’ experiment. URL: https://www.researchgate.net/publication/385855436_Appendix_to_the_Article_Prompt_Injection_The_Problem_of_Linguistic_Vulnerability_of_Large_Language_Models_at_the_Current_Stage (accessed 16.11.2024)

Reference lists

Chang, Z., Li, M., Liu, Y., Wang, J., Wang, Q. and Liu, Y. (2024). Play guessing game with LLM: Indirect jailbreak attack with implicit clues, arXiv preprint arXiv:2402.09091. https://doi.org/10.48550/arXiv.2402.09091(In English)

Chen, S., Zharmagambetov, A., Mahloujifar, S., Chaudhuri, K. and Guo, C. (2024). Aligning LLMs to Be Robust Against Prompt Injection. arXiv preprint arXiv:2410.05451. https://doi.org/10.48550/arXiv.2410.05451(In English)

Duan, M., Suri, A., Mireshghallah, N., Min, S., Shi, W., Zettlemoyer, L., Tsvetkov, Yu, Choi, Y., Evans, D. and Hajishirzi, H. (2024). Do membership inference attacks work on large language models?, arXiv preprint arXiv:2402.07841. https://doi.org/10.48550/arXiv.2402.07841(In English)

Hines, K., Lopez, G., Hall, M., Zarfati, F., Zunger, Y. and Kiciman, E. (2024). Defending Against Indirect Prompt Injection Attacks with Spotlighting, arXiv preprint arXiv:2403.14720. https://doi.org/10.48550/arXiv.2403.14720(In English)

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L. and Lewis, M. (2019). Generalization through memorization: Nearest neighbor language models, arXiv preprint arXiv:1911.00172. https://doi.org/10.48550/arXiv.1911.00172(In English).

Kumar, S. S., Cummings, M. L. and Stimpson, A. (2024, May). Strengthening LLM trust boundaries: A survey of prompt injection attacks, 2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS), 1–6, available at: https://www.researchgate.net/profile/Missy-Cummings/publication/378072627_Strengthening_LLM_Trust_Boundaries_A_Survey_of_Prompt_Injection_Attacks/links/65c57ac379007454976ae142/Strengthening-LLM-Trust-Boundaries-A-Survey-of-Prompt-Injection-Attacks.pdf/ (Accessed 29.06.2024). DOI: 10.1109/ICHMS59971.2024.10555871 (In English)

Li, X., Wang, R., Cheng, M., Zhou, T. and Hsieh, C. J. (2024). Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers. arXiv preprint arXiv:2402.16914. https://doi.org/10.48550/arXiv.2402.16914 (In English)

Liu, X., Yu, Z., Zhang, Y., Zhang, N. and Xiao, C. (2024). Automatic and universal prompt injection attacks against large language models, arXiv preprint arXiv:2403.04957. https://doi.org/10.48550/arXiv.2403.04957(In English)

Marvin, G. Hellen, N., Jjingo, D. and Nakatumba-Nabende, J. 2023). Prompt engineering in large language models, Proceedings of the International conference on data intelligence and cognitive informatics, Springer Nature Singapore, Singapore, 387–402, available at: https://www.researchgate.net/publication/377214553_Prompt_Engineering_in_Large_Language_Models (accessed 29.06.2024). DOI: 10.1007/978-981-99-7962-2_30 (In English)

Mudarova, R., Namiot, D. (2024). Countering Prompt Injection attacks on large language models, International Journal of Open Information Technologies, 12 (5), 39–48. (In Russian)

Pedro, R., Castro, D., Carreira, P. and Santos, N. (2023). From prompt injections to SQL injection attacks: How protected is your llm-integrated web application?, arXiv preprint arXiv:2308.01990. https://doi.org/10.48550/arXiv.2308.01990(In English)

Piet, J., Alrashed, M., Sitawarin, C., Chen, S., Wei, Z., Sun, E. and Wagner, D. (2023). Jatmo: Prompt injection defense by task-specific finetuning, arXiv preprint arXiv:2312.17673. https://doi.org/10.48550/arXiv.2312.17673(In English)

Röttger, P., Pernisi, F., Vidgen, B. and Hovy, D. (2024). Safety prompts: a systematic review of open datasets for evaluating and improving large language model safety, arXiv preprint arXiv:2404.05399.m. https://doi.org/10.48550/arXiv.2404.05399(In English)

Rossi, S., Michel, A M., Mukkamala, R. R. and Thatcher, J. B. (2024). An Early Categorization of Prompt Injection Attacks on Large Language Models, arXiv preprint arXiv:2402.00898. https://doi.org/10.48550/arXiv.2402.00898(In English)

Tavabi, N., Goyal, P., Almukaynizi, M., Shakarian, P. and Lerman, K. (2018). Darkembed: Exploit prediction with neural language models, Proceedings of the AAAI Conference on Artificial Intelligence, 32, 1, 7849–7854. https://doi.org/10.1609/aaai.v32i1.11428(In English)

Yan, J., Yadav, V., Li, S., Chen, L., Tang, Z., Wang, H. and Jin, H. (2024). Backdooring instruction-tuned large language models with virtual prompt injection, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, Long Papers, 6065–6086. DOI: 10.18653/v1/2024.naacl-long.337 (In English)

Yu, J., Wu, Y., Shu, D., Jin, M., Yang, S. and Xing, X. (2023). Assessing prompt injection risks in 200+ custom GPTS, arXiv preprint arXiv:2311.11538. https://doi.org/10.48550/arXiv.2311.11538(In English)

Yu, Z., Liu, X., Liang, S., Cameron, Z., Xiao, C. and Zhang, N. (2024). Don't Listen to Me: Understanding and Exploring Jailbreak Prompts of Large Language Models, arXiv preprint arXiv:2403.17336. https://doi.org/10.48550/arXiv.2403.17336(In English)

Zhang, J. (2024). Should We Fear Large Language Models? A Structural Analysis of the Human Reasoning System for Elucidating LLM Capabilities and Risks Through the Lens of Heidegger’s Philosophy, arXiv preprint arXiv:2403.03288. https://doi.org/10.48550/arXiv.2403.03288(In English)