LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

Kavli Affiliate: Zhuo Li

| First 5 Authors: Huizhen Shu, Huizhen Shu, , ,

| Summary:

Achieving robust safety alignment in large language models (LLMs) while
preserving their utility remains a fundamental challenge. Existing approaches
often struggle to balance comprehensive safety with fine-grained
controllability at the representation level. We introduce LATENTGUARD, a novel
three-stage framework that combines behavioral alignment with supervised latent
space control for interpretable and precise safety steering. Our approach
begins by fine-tuning an LLM on rationalized datasets containing both
reasoning-enhanced refusal responses to adversarial prompts and
reasoning-enhanced normal responses to benign queries, establishing robust
behavioral priors across both safety-critical and utility-preserving scenarios.
We then train a structured variational autoencoder (VAE) on intermediate MLP
activations, supervised by multi-label annotations including attack types,
attack methods, and benign indicators. This supervision enables the VAE to
learn disentangled latent representations that capture distinct adversarial
characteristics while maintaining semantic interpretability. Through targeted
manipulation of learned latent dimensions, LATENTGUARD achieves selective
refusal behavior, effectively blocking harmful requests while preserving
helpfulness for legitimate use cases. Experiments on Qwen3-8B demonstrate
significant improvements in both safety controllability and response
interpretability without compromising utility. Cross-architecture validation on
Mistral-7B confirms the generalizability of our latent steering approach,
showing consistent effectiveness across different model families. Our results
suggest that structured representation-level intervention offers a promising
pathway toward building safer yet practical LLM systems.

| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3