Jailbreaks are becoming a new ImageNet competition instead of helping us better understand LLM security. The community should revisit their choices and focus on research that can uncover new security vulnerabilities.
Jailbreak papers keep landing on arXiv and conferences. Most of them look the same and jailbreaks have turned into a new sort of ImageNet competition. This posts discusses the reasons why most of these papers are no longer valuable to the community, and how we could maximize the impact of our work to improve our understanding of LLM vulnerabilities and defenses.
Let’s start with what jailbreaks are. LLMs are fine-tuned to refuse harmful instructions
What we have | What we want | What we do |
---|---|---|
Pre-trained LLMs that have, and can use, hazardous knowledge. | Safe models that do not cause harm or help users with harmful activities. | Deploy security features that often materialize as refusal for harmful requests. |
In security, it is important to red-team protections to expose vulnerabilities and improve upon those. The first works on LLM red-teaming
Follow-up research found more ways to exploit LLMs and access hazardous knowledge. We saw methods like GCG, which optimizes text suffixes that surprisingly transfer across models. We also found ways to automate jailbreaks using other LLMs
However, the academic community has since turned jailbreaks into a new sort of ImageNet competition, focusing on achieving marginally higher attack success rates rather than improving our understanding of LLM vulnerabilities. When you start a new work, ask yourself whether you are (1) going to find a new vulnerability that helps the community understand how LLM security fails, or if you are (2) just looking for a better attack that exploits an existing vulnerability. (2) is not very interesting academically. In fact, coming back to my previous idea of understanding jailbreaks as evaluation tools, the field still uses the original GCG jailbreak to evaluate robustness, rather than its marginally improved successors.
We can learn valuable lessons from previous security research. The history of buffer overflow research is a good example: after the original “Smashing The Stack for Fun and Profit” paper
A jailbreak paper accepted in a main conference should:
Uncover a security vulnerability in a defense/model that is claimed to be robust. New research should target systems that we know have been trained not to be jailbreakable and prompts that violate the policies used to determine what prompts should be refused. Otherwise, your findings are probably not transferable. For example, if someone finds an attack that can systematically bypass the Circuit Breakers defense
Not iterate on existing vulnerabilities. We know models can be jailbroken with role-playing, do not look for a new fictional scenario. We know models can be jailbroken with encodings, do not suggest a new encoding. Examples of novel vulnerabilities we have seen lately include latent-space interventions
Another common problem is playing the wack-a-mole game with jailbreaks and patches. If a specific attack was patched, there is very little contribution in showing that a small change to the attack breaks the updated safeguards since we know that patches do not fix the underlying vulnerabilities
Explore new threat models in new production models or modalities. Models, their use cases, and their architectures keep changing. For example, we now have fusion models with multimodal inputs, and will soon have powerful agents
However, the works we keep seeing over and over again look more like “we know models Y are/were vulnerable to method X, and we show that if you use X’ you can obtain an increase of 5% on models Y”. The most common example are improvements on role-play jailbreaks. People keep finding ways to turn harmful tasks into different fictional scenarios. This is not helping us uncover new security vulnerabilities! Before starting a new project, try to think whether the outcome is going to help us uncover a previously unknown vulnerability.
Another common problem has to do with defenses. We all want to solve jailbreaks, but we need to maintain a high standard for defenses. This isn’t new, by the way. There are some great compilations of lessons learned from adversarial examples in the computer vision era.
If you work on defenses, you should take the following into account:
We should all think about the bigger problem we have at hand: we do not know how to ensure that LLMs behave the way we want. By default, researchers should avoid working on new jailbreaks unless they have a very good reason to. Answering these questions may help:
If you are interested in improving the security and safety of LLMs (these two are very different
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX