Kavli Affiliate: Yi Zhou
| First 5 Authors: Yi Zhou, Wenpeng Xing, Dezhang Kong, Changting Lin, Meng Han
| Summary:
Safety alignment in large language models (LLMs) is achieved through
fine-tuning mechanisms that regulate neuron activations to suppress harmful
content. In this work, we propose a novel approach to induce disalignment by
identifying and modifying the neurons responsible for safety constraints. Our
method consists of three key steps: Neuron Activation Analysis, where we
examine activation patterns in response to harmful and harmless prompts to
detect neurons that are critical for distinguishing between harmful and
harmless inputs; Similarity-Based Neuron Identification, which systematically
locates the neurons responsible for safe alignment; and Neuron Relearning for
Safety Removal, where we fine-tune these selected neurons to restore the
model’s ability to generate previously restricted responses. Experimental
results demonstrate that our method effectively removes safety constraints with
minimal fine-tuning, highlighting a critical vulnerability in current alignment
techniques. Our findings underscore the need for robust defenses against
adversarial fine-tuning attacks on LLMs.
| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3