Kavli Affiliate: Zhuo Li
| First 5 Authors: Yuhao Du, Zhuo Li, Pengyu Cheng, Xiang Wan, Anningzhe Gao
| Summary:
Large Language Models (LLMs) have become a focal point in the rapidly
evolving field of artificial intelligence. However, a critical concern is the
presence of toxic content within the pre-training corpus of these models, which
can lead to the generation of inappropriate outputs. Investigating methods for
detecting internal faults in LLMs can help us understand their limitations and
improve their security. Existing methods primarily focus on jailbreaking
attacks, which involve manually or automatically constructing adversarial
content to prompt the target LLM to generate unexpected responses. These
methods rely heavily on prompt engineering, which is time-consuming and usually
requires specially designed questions. To address these challenges, this paper
proposes a target-driven attack paradigm that focuses on directly eliciting the
target response instead of optimizing the prompts. We introduce the use of
another LLM as the detector for toxic content, referred to as ToxDet. Given a
target toxic response, ToxDet can generate a possible question and a
preliminary answer to provoke the target model into producing desired toxic
responses with meanings equivalent to the provided one. ToxDet is trained by
interacting with the target LLM and receiving reward signals from it, utilizing
reinforcement learning for the optimization process. While the primary focus of
the target models is on open-source LLMs, the fine-tuned ToxDet can also be
transferred to attack black-box models such as GPT-4o, achieving notable
results. Experimental results on AdvBench and HH-Harmless datasets demonstrate
the effectiveness of our methods in detecting the tendencies of target LLMs to
generate harmful responses. This algorithm not only exposes vulnerabilities but
also provides a valuable resource for researchers to strengthen their models
against such attacks.
| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3