How to Jailbreak Machine Learning With Machine LearningResearchers Automate Tricking LLMs Into Providing Harmful Information
A small group of researchers says it has identified an automated method for jailbreaking OpenAI, Meta and Google large language models with no obvious fix. Just like the algorithms that researchers can force into giving dangerous or undesirable responses, the technique depends on machine learning.
A team of seven researchers from Robust Intelligence and Yale University said Tuesday that bypassing guardrails doesn't require specialized knowledge such as the model parameters.
Instead, would-be jailbreakers can ask LLMs to come up with convincing jailbreaks, in an iterative setup the researchers dub "Tree of Attacks with Pruning." In it, one LLM generates jailbreaking prompts, another evaluates the generated prompts, and a final model serves as the target.
"Even with the considerable time and effort spent by the likes of OpenAI, Google, and Meta, these guardrails are not resilient enough to protect enterprises and their users today," wrote Paul Kassianik, a senior research engineer at Robust Intelligence.
The pace of LLM development has skyrocketed as an increasing number of organizations adopt AI technology at scale. The pace of development outpaces security - researchers have already demonstrated multiple methods for jailbreaking LLMs, whether through specialized knowledge of the model weights or adversarial prompts.
The jailbreak technique allowed the Robust Intelligence and Yale researchers to trick models into giving them instructions to prompts they would ideally refuse, such as providing a recipe for making a homemade explosive device, describing how to use a phone to stalk and harass someone, or demonstrating how to pirate software and distribute it online.
Hackers can also deploy the Tree of Attacks with Pruning, or TAP, process to deploy more effective cyberattacks, the report said. "Each refined approach undergoes a series of checks to ensure it aligns with the attacker's objectives, followed by evaluation against the target system. If the attack is successful, the process concludes. If not, it iterates through the generated strategies until a successful breach is achieved."
TAP can also help hackers cover their tracks better by minimizing the number of queries the target model is sent. One of the most common ways to detect an attack is to monitor the internet traffic going to a resource for multiple successive requests. The lower the number is, the more likely it is to pass under the security radar. The researchers said TAP decreases queries sent to state-of-the-art LLMs by 30% per jailbreak.