Kavli Affiliate: Ke Wang
| First 5 Authors: Xiaoqian Liu, Ke Wang, Yongbin Li, Yuchuan Wu, Wentao Ma
| Summary:
Large Language Models (LLMs) have shown impressive reasoning capabilities in
well-defined problems with clear solutions, such as mathematics and coding.
However, they still struggle with complex real-world scenarios like business
negotiations, which require strategic reasoning-an ability to navigate dynamic
environments and align long-term goals amidst uncertainty. Existing methods for
strategic reasoning face challenges in adaptability, scalability, and
transferring strategies to new contexts. To address these issues, we propose
explicit policy optimization (EPO) for strategic reasoning, featuring an LLM
that provides strategies in open-ended action space and can be plugged into
arbitrary LLM agents to motivate goal-directed behavior. To improve
adaptability and policy transferability, we train the strategic reasoning model
via multi-turn reinforcement learning (RL) using process rewards and iterative
self-play, without supervised fine-tuning (SFT) as a preliminary step.
Experiments across social and physical domains demonstrate EPO’s ability of
long-term goal alignment through enhanced strategic reasoning, achieving
state-of-the-art performance on social dialogue and web navigation tasks. Our
findings reveal various collaborative reasoning mechanisms emergent in EPO and
its effectiveness in generating novel strategies, underscoring its potential
for strategic reasoning in real-world applications.
| Search Query: ArXiv Query: search_query=au:”Ke Wang”&id_list=&start=0&max_results=3