Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Kavli Affiliate: Zhuo Li

| First 5 Authors: Huizhen Shu, Huizhen Shu, , ,

| Summary:

With the rapid proliferation of Natural Language Processing (NLP), especially
Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs
remains a key challenge for understanding model vulnerabilities and improving
robustness. In this context, we propose a new black-box attack method that
leverages the interpretability of large models. We introduce the Sparse Feature
Perturbation Framework (SFPF), a novel approach for adversarial text generation
that utilizes sparse autoencoders to identify and manipulate critical features
in text. After using the SAE model to reconstruct hidden layer representations,
we perform feature clustering on the successfully attacked texts to identify
features with higher activations. These highly activated features are then
perturbed to generate new adversarial texts. This selective perturbation
preserves the malicious intent while amplifying safety signals, thereby
increasing their potential to evade existing defenses. Our method enables a new
red-teaming strategy that balances adversarial effectiveness with safety
alignment. Experimental results demonstrate that adversarial texts generated by
SFPF can bypass state-of-the-art defense mechanisms, revealing persistent
vulnerabilities in current NLP systems.However, the method’s effectiveness
varies across prompts and layers, and its generalizability to other
architectures and larger models remains to be validated.

| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3