Kavli Affiliate: Yi Zhou | First 5 Authors: Yi Zhou, Wenpeng Xing, Dezhang Kong, Changting Lin, Meng Han | Summary: Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying […]
Continue.. NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models