ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

Kavli Affiliate: Jia Liu

| First 5 Authors: Jia Liu, Jia Liu, , ,

| Summary:

Recent advancements in Large Language Models have yielded significant
improvements in complex reasoning tasks such as mathematics and programming.
However, these models remain heavily dependent on annotated data and exhibit
limited adaptability in unsupervised scenarios. To address these limitations,
test-time reinforcement learning (TTRL) has been proposed, which enables
self-optimization by leveraging model-generated pseudo-labels. Despite its
promise, TTRL faces several key challenges, including high inference costs due
to parallel rollouts and early-stage estimation bias that fosters
overconfidence, reducing output diversity and causing performance plateaus. To
address these challenges, we introduce an entropy-based mechanism to enhance
the exploration-exploitation balance in test-time reinforcement learning
through two strategies: Entropy-fork Tree Majority Rollout (ETMR) and
Entropy-based Advantage Reshaping (EAR). Compared with the baseline, our
approach enables Llama3.1-8B to achieve a 68 percent relative improvement in
Pass at 1 metric on the AIME 2024 benchmark, while consuming only 60 percent of
the rollout tokens budget. This highlights our method’s ability to effectively
optimize the trade-off between inference efficiency, diversity, and estimation
robustness, thereby advancing unsupervised reinforcement learning for
open-domain reasoning tasks.

| Search Query: ArXiv Query: search_query=au:”Jia Liu”&id_list=&start=0&max_results=3