{"id":134543,"date":"2024-06-04T10:00:00","date_gmt":"2024-06-04T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/security\/blog\/?p=134543"},"modified":"2024-06-25T16:16:37","modified_gmt":"2024-06-25T23:16:37","slug":"ai-jailbreaks-what-they-are-and-how-they-can-be-mitigated","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/security\/blog\/2024\/06\/04\/ai-jailbreaks-what-they-are-and-how-they-can-be-mitigated\/","title":{"rendered":"AI jailbreaks: What they are and how they can be mitigated"},"content":{"rendered":"\n

Generative AI systems are made up of multiple components that interact to provide a rich user experience between the human and the AI model(s). As part of a responsible AI approach<\/a>, AI models are protected by layers of defense mechanisms to prevent the production of harmful content or being used to carry out instructions that go against the intended purpose of the AI integrated application. This blog will provide an understanding of what AI jailbreaks are, why generative AI is susceptible to them, and how you can mitigate the risks and harms.<\/p>\n\n\n\n

What is AI jailbreak?<\/h2>\n\n\n\n

An AI jailbreak is a technique<\/em> that can cause the failure of guardrails (mitigations<\/em>). The resulting harm<\/em> comes from whatever guardrail was circumvented: for example, causing the system to violate its operators\u2019 policies, make decisions unduly influenced by one user, or execute malicious instructions. This technique<\/em> may be associated with additional attack techniques<\/em> such as prompt injection, evasion, and model manipulation. You can learn more about AI jailbreak techniques in our AI red team\u2019s Microsoft Build session, How Microsoft Approaches AI Red Teaming<\/a>.<\/p>\n\n\n

\"Diagram
Figure 1. AI safety finding ontology <\/em><\/figcaption><\/figure>\n\n\n\n

Here is an example of an attempt to ask an AI assistant to provide information about how to build a Molotov cocktail (firebomb). We know this knowledge is built into most of the generative AI models available today, but is prevented from being provided to the user through filters and other techniques to deny this request. Using a technique like Crescendo<\/a>, however, the AI assistant can produce the harmful content that should otherwise have been avoided. This particular problem has since been addressed in Microsoft’s safety filters; however, AI models are still susceptible to it. Many variations of these attempts are discovered on a regular basis, then tested and mitigated.<\/p>\n\n\n\n

\"Animated
Figure 2. Crescendo attack to build a Molotov cocktail <\/em><\/figcaption><\/figure>\n\n\n\n

Why is generative AI susceptible to this issue?<\/strong><\/h2>\n\n\n\n

When integrating AI into your applications, consider the characteristics of AI and how they might impact the results and decisions made by this technology. Without anthropomorphizing AI, the interactions are very similar to the issues you might find when dealing with people. You can consider the attributes of an AI language model to be similar to an eager but inexperienced employee trying to help your other employees with their productivity:<\/p>\n\n\n\n

    \n
  1. Over-confident: They may confidently present ideas or solutions that sound impressive but are not grounded in reality, like an overenthusiastic rookie who hasn’t learned to distinguish between fiction and fact.<\/li>\n\n\n\n
  2. Gullible: They can be easily influenced by how tasks are assigned or how questions are asked, much like a na\u00efve employee who takes instructions too literally or is swayed by the suggestions of others.<\/li>\n\n\n\n
  3. Wants to impress: While they generally follow company policies, they can be persuaded to bend the rules or bypass safeguards when pressured or manipulated, like an employee who may cut corners when tempted.<\/li>\n\n\n\n
  4. Lack of real-world application: Despite their extensive knowledge, they may struggle to apply it effectively in real-world situations, like a new hire who has studied the theory but may lack practical experience and common sense.<\/li>\n<\/ol>\n\n\n\n

    In essence, AI language models can be likened to employees who are enthusiastic and knowledgeable but lack the judgment, context understanding, and adherence to boundaries that come with experience and maturity in a business setting.<\/p>\n\n\n\n

    So we can say that generative AI models and system have the following characteristics:<\/p>\n\n\n\n