综述:Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
paper | url | author | date |
---|---|---|---|
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks1 | [2310.10844] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks (arxiv.org) | Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh | Mon, 16 Oct 2023 |
jailbreak attack
利用 “language modeling (pretraining)”, “instruction following”, and “safety training” 三个阶段的目标不同
Mismatched Generalization: This failure mode stems from the significant gap between the complexity and diversity of the pretraining dataset and the safety training dataset. 基于Base64编码的越狱提示就是这种失败模式的一个例子
参考
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.