Kavli Affiliate: Zhuo Li | First 5 Authors: Xuying Li, Zhuo Li, Yuji Kosuga, Victor Bian, | Summary: Aligning large language models (LLMs) with human values and safety constraints is challenging, especially when objectives like helpfulness, truthfulness, and avoidance of harm conflict. Reinforcement Learning from Human Feedback (RLHF) has achieved notable success in steering models, […]
Continue.. Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach