paper | url | author | date |
---|---|---|---|
Universal and Transferable Adversarial Attacks on Aligned Language Models | 2307.15043.pdf (arxiv.org) | Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson | 27 Jul 2023 |
llm-attacks/llm-attacks: Universal and Transferable Attacks on Aligned Language Models (github.com)
jailbreaks
,” carefully engineered prompts that result in aligned LLMs generating clearly objectionable content [Wei et al., 2023].jailbreaks
are typically crafted through human ingenuity rather than automated methods, requiring substantial manual effortautomatic prompt-tuning
unable to generate reliable attacks through automatic search methods ( This owes largely to the fact that, unlike image models, LLMs operate on discrete token inputs, which both substantially limits the effective input dimensionality, and seems to induce a computationally difficult search)
AutoPrompt
approach, but with the (we find, practically quite important) difference that we search over all possible tokens to replace at each step, rather than just a single one.对于一个LLM来说,实际上的输入并不只是 user 的输入, 同时还包括一段system prompt, 因此用户本身能控制的prompt其实很少.
首先把LLM简化为一个序列模型,那么通过序列$x_{1:n}$预测到的下一个token的概率为: \(p(x_{n+1}\vert x_{1:n})\) 此外有: \(p(x_{n+1:n+H}\vert x_{1:n})=\prod_{i=1}^{H}p(x_{n+i}\vert x_{1:n+i-1})\) 而我们攻击的目标则是使得LLM生成一些关键序列($x_{n+1:n+H}^{\star}$)
那么相应的对抗损失则是: \(\mathcal{L}(x_{n+1})=-\log p(x_{n+1:n+H}^{\star}\vert x_{1:n})\)
所以,寻找adversarial suffix的过程就变成了:
\[minimize_{x_{\mathcal{I}\in\{1,2,..V\}^{\vert x \vert}} \mathcal{L}(x_{n+1})}\]the straightforward approach to optimize a discrete of inputs:simple extension of the AutoPrompt method
comprehension
hesitation
一开始的候选k个token是具体怎么得到的?
suffix是否一开始设定了长度?
GCG is quite similar to the AutoPrompt
comprehension
TODO
conclusion
future work