Kavli Affiliate: Zhuo Li | First 5 Authors: Huizhen Shu, Huizhen Shu, , , | Summary: Achieving robust safety alignment in large language models (LLMs) while preserving their utility remains a fundamental challenge. Existing approaches often struggle to balance comprehensive safety with fine-grained controllability at the representation level. We introduce LATENTGUARD, a novel three-stage framework […]
Continue.. LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation