paper url author date
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks1 [2310.10844] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks (arxiv.org) Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh Mon, 16 Oct 2023

jailbreak attack

  • 利用 “language modeling (pretraining)”, “instruction following”, and “safety training” 三个阶段的目标不同

    20240205132209
  • Mismatched Generalization: This failure mode stems from the significant gap between the complexity and diversity of the pretraining dataset and the safety training dataset. 基于Base64编码的越狱提示就是这种失败模式的一个例子

    20240205133434

参考


  1. 对抗性攻击揭示的大语言模型脆弱性:综述 - 知乎 (zhihu.com)↩︎