Notebook

Universal and Transferable Adversarial Attacks on Aligned Language Models

paper url author date
Universal and Transferable Adversarial Attacks on Aligned Language Models 2307.15043.pdf (arxiv.org) Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson 27 Jul 2023

llm-attacks/llm-attacks: Universal and Transferable Attacks on Aligned Language Models (github.com)

Abstract

1 Introduction

通过对抗性后缀攻击大型语言模型 - LLM Safety论文精读(三) - 知乎 (zhihu.com)

2 A Universal Attack on LLMs

对于一个LLM来说,实际上的输入并不只是 user 的输入, 同时还包括一段system prompt, 因此用户本身能控制的prompt其实很少.

20231108200633

2.1 Formalizing the adversarial objective

首先把LLM简化为一个序列模型,那么通过序列$x_{1:n}$预测到的下一个token的概率为: \(p(x_{n+1}\vert x_{1:n})\) 此外有: \(p(x_{n+1:n+H}\vert x_{1:n})=\prod_{i=1}^{H}p(x_{n+i}\vert x_{1:n+i-1})\) 而我们攻击的目标则是使得LLM生成一些关键序列($x_{n+1:n+H}^{\star}$)

那么相应的对抗损失则是: \(\mathcal{L}(x_{n+1})=-\log p(x_{n+1:n+H}^{\star}\vert x_{1:n})\)

所以,寻找adversarial suffix的过程就变成了:

\[minimize_{x_{\mathcal{I}\in\{1,2,..V\}^{\vert x \vert}} \mathcal{L}(x_{n+1})}\]

20231108210156

the straightforward approach to optimize a discrete of inputs:simple extension of the AutoPrompt method

comprehension

hesitation

GCG is quite similar to the AutoPrompt

[2010.15980] AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts (arxiv.org)

2.3 Universal Multi-prompt and Multi-model attacks

20231108223017

comprehension

3 Experimental Results: Direct and Transfer Attacks

TODO

20231108230709

20231108231650

5 Conclusion and Future Work

conclusion

future work