综述:Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

paper	url	author	date
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks¹	[2310.10844] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks (arxiv.org)	Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh	Mon, 16 Oct 2023

jailbreak attack

利用 “language modeling (pretraining)”, “instruction following”, and “safety training” 三个阶段的目标不同
Mismatched Generalization: This failure mode stems from the significant gap between the complexity and diversity of the pretraining dataset and the safety training dataset. 基于Base64编码的越狱提示就是这种失败模式的一个例子