PARDEN, Can You Repeat That?
Defending against Jailbreaks via Repetition
Ziyang Zhang, Qizhen Zhang, Jakob Foerster
FLAIR, University of Oxford
Accepted at ICML 2024
See the original paper here
TL;DR:
We prompt the LLM to repeat its own output, which protects the model from adversarial attacks by avoiding the auto-regressive trap. [1] We name this method PARDEN.[2]
We find that PARDEN is particularly effective in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at a threshold corresponding to a TPR of 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviors dataset, compared with the baseline (Helbling et al., 2023).
PARDEN is illustrated in the following flowchart (Figure 1):
Figure 1: PARDEN
Our dataset and code is available at https://github.com/Ed-Zh/PARDEN
Footnote:
[1] LLMs sample responses one token at a time, without planning/anticipating what comes in the farther future. As a result, the LLM can start sampling the response “Sure let me help you with that.. ” (rather than refusal), without “realising” that compliance with the request ultimately results in a detailed instruction for committing cyber-crime. For reference, see a simple example and a detailed discussion.
[2] PARDEN is short for (Safe-Proofing Language Models via a Repetition Defense)
Try PARDEN here!
Select a scenario below to see how PARDEN works in action. Observe how PARDEN acts on benign and harmful outputs. Notice how the baseline classifier tends to default to "I do not have enough context" when prompted to classify.
Note: the above PARDEN repeat prompt uses Claude-2.1 as the defender model (via API access), where the original LLM output is generated by Llama2-7B. To run PARDEN with other defender models (such as Llama2-7B), follow this notebook .
What are jailbreaks?
They say a safe GPT is a boring GPT. Ever wondered how to jailbreak LLMs?
Rigorous safety alignment processes have mostly made hand-tuned jailbreaks a thing of the past - your chatGPT will no longer teach you how to build a bomb right away, even if you promise it’s for helping your old grandma.
However, recent studies have shown that supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to algorithmic jailbreaks, Zou et al., 2023 Qi et al., 2023 which are inputs that induce LLMs to produce undesirable outputs. This creates significant risk, if LLMs were to be abused by users with ill intentions.
Screenshots of harmful content. Adapted from Zou et al., 2023
How does the "autoregressive trap" lead to jailbreaks?
The first step of developing a safeguard is to understand how jailbreaks are produced. Many attacks (Zhu et al., 2023; Zou et al., 2023) exploit the so-called “auto-regressive trap”, i.e. the fact that LLMs sample responses one token at a time, without planning/anticipating what comes in the further future.
As a result, the LLM can start sampling the response “Sure let me help you with that.. ” (rather than refusal), without “realising” that compliance with the request ultimately results in a detailed instruction for committing cyber-crime.
PARDEN! Can you repeat that?
Many jailbreaks also use LLMs as attackers, which become more powerful over time. Therefore, it is crucial to develop defence methods which also use the LLMs themselves to prevent jailbreaks, which will improve as the LLMs improve.
We propose PARDEN, a simple but surprisingly effective method for detecting jailbreaks. On a high-level, PARDEN prompts the LLM to repeat its own output, with a few in-context examples included in the prompt to prime the method. (This is illustrated above in Figure 1)
Our Finding
How does the repeat compare with the original output? We observe that 1) if the output is benign, the LLM attempts to repeat the output verbatim; 2) if the output is harmful, the LLM refuses to repeat (I’m sorry I can’t help with that), provided that the LLM is safety-aligned .
To quantify this observation, we use the BLEU score between y and REPEAT(y), to determine whether the model is attempting to repeat the output or refusing to do so. Intuitively, when the model is attempting to repeat, the repeat is nearly identical to the original output, giving a BLEU score near 1.0 (red distribution); conversely, when the model is refusing to repeat, the repeat/refusal is highly dissimilar to the original output, giving a low BLEU score (blue distribution).
Figure 2: The BLEU score distributions. Left: BLEU(y,REPEAT(y)) for Llama2-7B defending Llama2-7B. Right: BLEU(y,REPEAT(y)) for Claude-2.1 defending Claude-2.1
Benchmarking PARDEN
Setting a threshold t on the BLEU score induces a family of classifiers, indexed by t. Formally, let ht be a classifier of outputs, defined by:
This allows us to quantify the performance of PARDEN as a quantifier, using statistics such as TPR (True Positive Rate), FPR (False Positive Rate), and AUC (Area Under Curve), and compare it to the baseline (Helbling et al., 2023)
Experimental Results
Note that PARDEN assumes a safety-aligned base model. So it is expected to work when claude or llama is used as the defender, but not when mistral is used.
As shown in the table below, across almost all models and all datasets, except for when Mistral-7B is used as a defender, PARDEN consistently outperforms the baseline (Helbling et al., 2023) and considerably reduces the FPR for the same TPR in 9 out of 10 scenarios where its premise is satisfied (i.e. llama or claude as defender). For comprehensiveness, we include results on mistral.
Attacked LLM | Defender LLM | Harmful Dataset | (TPR, FPR) Baseline | (TPR, FPR) Ours |
---|---|---|---|---|
mistral | claude | behaviors | (63.85%, 4.17%) | (63.85%, 1.27%) |
llama | claude | behaviors | (76.65%, 1.09%) | (76.65%, 1.27%) |
mistral | claude | strings | (47.56%, 4.17%) | (90.00%, 0.91%) |
claude | claude | strings | (69.20%, 2.72%) | (90.00%, 1.09%) |
llama | claude | strings | (63.84%, 1.09%) | (90.00%, 1.09%) |
mistral | llama | behaviors | (90.00%, 66.67%) | (90.00%, 7.43%) |
llama | llama | behaviors | (90.00%, 24.80%) | (90.00%, 1.99%) |
mistral | llama | strings | (90.00%, 7.25%) | (90.00%, 1.09%) |
claude | llama | strings | (90.00%, 13.41%) | (90.00%, 10.69%) |
llama | llama | strings | (90.00%, 1.81%) | (90.00%, 0.36%) |
mistral | mistral | behaviors | (90.00%, 94.75%) | (90.00%, 100.00%) |
llama | mistral | behaviors | (90.00%, 82.02%) | (90.00%, 100.00%) |
mistral | mistral | strings | (90.00%, 84.96%) | (90.00%, 100.00%) |
claude | mistral | strings | (90.00%, 99.64%) | (90.00%, 34.06%) |
llama | mistral | strings | (90.00%, 79.96%) | (90.00%, 100.00%) |
Here 'llama' is Llama2-7B, 'mistral' is Mistral-7B, and 'claude' is Claude-2.1. For PARDEN, we specifically select the threshold t so that TPR is fixed at 90%. For the baseline classifier method, we similarly 1) fix it to 90%, if we have access to the logits (for white box models {mistral, llama}), and 2) use the raw text output, otherwise (for black box models i.e. claude). Harmful Type refers to the harmful strings and behaviors in AdvBench Zou et al., 2023.
What benefit does PARDEN offer?
1.PARDEN solely operates on the output space, whereas safeguard mechanisms on the input space can be explicitly circumvented. By censoring the output rather than the input, we make the defence more difficult for attackers to directly target.
2.Models adapt to evolving definitions of what constitutes harmful content. By asking the model to repeat itself, we are bootstrapping the model's latest training to reassess the output - this dynamic adaptation comes at no extra cost. However, a static classification criteria might become outdated or fail to capture nuanced or emerging forms of harmful content, unless we perform further instruction tuning on classifying those content.
These advantages make PARDEN a very promising and versatile technique (that requires no training!) While llama-guard and other explicitly trained classifiers need to be retrained over and over again as definitions and accepted ethics standards evolve, the PARDEN defender automatically evolves alongside the base model.
Failure Cases
Although more effective than alternative approaches, PARDEN does have false negatives. Inspecting these failure cases revealed that these outputs are often not harmful in themselves but had malicious intentions which are only present in the prompt.
One such example is asking the models to write fake reviews. This underscores the importance of contextualizing harmful content in its intention. Clearly, it is questionable if this should even be considered a false negative, since the user could have simply lied about their request and produced the same (per-se harmless) output.
Future Work Directions
Repetition is one operation to stitch together two LMs. Mathematically, we can think of benign examples as the fixed points of this operation, since they’re preserved under repetition on LMs, whereas harmful examples are corrupted by repetition. Interestingly, in addition to harmful and benign examples, we discovered that atypical, gibberish text also tends to be poorly preserved. What are some other operations one can define on high-order LMs? What does their fixed points and non-fixed points tell us?
Using PARDEN to correct jailbreaks, we can also gather a dataset of (jailbreak prompt, PARDEN corrected output). Future work may explore how fine-tuning on this dataset can enhance the safety of LLMs.
Citation
@inproceedings{PARDEN, title={PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition}, author={Zhang, Ziyang and Zhang, Qizhen and Foerster, Jakob}, booktitle={International Conference on Machine Learning}, year={2024}, organization={PMLR} }