Researchers uncover surprising method to hack the guardrails of LLMs

RTSLULRQ-blog Researchers from Carnegie Mellon University and the Center for A.I. Safety have discovered a new prompt injection method to override the guardrails of large language models (LLMs). These guardrails are safety measures designed to prevent AI from generating harmful content.

This discovery poses a significant risk to the deployment of LLMs in public-facing applications, as it could potentially allow these models to be used for malicious purposes.

The researchers' attack method was effective on all chatbots tested, including OpenAI’s ChatGPT, Google’s Bard, Microsoft’s Bing Chat, and Anthropic’s Claude 2. There are implications for applications based on open-source LLMs, like Meta’s LLaMA models.

The attack works by exploiting the AI model's weights, which determine the influence of each node in a neural network. The researchers developed a program that searches for suffixes that, when added to a prompt, can override the system’s guardrails. These suffixes, appearing as random characters and nonsense words to humans, can trick the LLM into providing the response the attacker desires.

The researchers found their attacks to be highly successful, especially against open-source chatbots. For instance, their attacks had a near 100% success rate against Vicuna, a chatbot built on top of Meta’s original LlaMA. Even against Meta’s newest LlaMA 2 models, which were designed to have stronger guardrails, the attack method achieved a 56% success rate for any individual bad behavior.

Surprisingly, the same attack suffixes also worked relatively well against proprietary models, where the companies only provide access to a public-facing prompt interface. Zico Kolter, one of the Carnegie Mellon professors who worked on the research, suggests that this might be due to the nature of language itself and how deep learning systems build statistical maps of language.

Despite the findings, the researchers argue against the notion that powerful A.I. models should not be open-sourced. They believe that open-source models are crucial for identifying security vulnerabilities and developing better solutions. They warn that making all LLMs proprietary would only enable those with enough resources to build their own LLMs to engineer such attacks, while independent academic researchers would be unable to develop safeguards against them.

However, the researchers also acknowledge that their discovery might not bode well for the odds of being able to mitigate this newly discovered LLM vulnerability. The research builds on methods that had previously been successful at attacking image classification A.I. systems, and no reliable solution has been found to defeat those attack methods without sacrificing the A.I. model’s overall performance and efficiency.

In light of these findings, orgs need to realize commercial products may be vulnerable until we can better understand and address the security vulnerabilities. There's already evidence that the U.S. government is leaning towards requiring companies to keep model weights private and secure.

Request A Quote: Security Awareness Training

New-school Security Awareness Training is critical to enabling you and your IT staff to connect with users and help them make the right security decisions all of the time. This isn't a one and done deal, continuous training and simulated phishing are both needed to mobilize users as your last line of defense. Request your quote for KnowBe4's security awareness training and simulated phishing platform and find out how affordable this is!