RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training

Kavli Affiliate: Wei Gao

| First 5 Authors: Wei Gao, Wei Gao, , ,

| Summary:

Reinforcement Learning (RL) is a pivotal post-training technique for
enhancing the reasoning capabilities of Large Language Models (LLMs). However,
synchronous RL post-training often suffers from significant GPU
underutilization, referred to as bubbles, caused by imbalanced response lengths
within rollout steps. Many RL systems attempt to alleviate this problem by
relaxing synchronization, but this can compromise training accuracy. In this
paper, we introduce tail batching, a novel rollout scheduling strategy for
synchronous RL that systematically consolidates prompts leading to long-tail
responses into a small subset of rollout steps (long rounds), while ensuring
that the majority of steps (short rounds) involve only balanced, short
rollouts. By excluding long responses from short rounds and rescheduling them
into a few designated long rounds, tail batching effectively reduces GPU idle
time during rollouts and significantly accelerates RL training without
sacrificing accuracy. We present RollPacker, a system that fully harnesses the
benefits of tail batching through holistic optimizations across all three RL
stages: elastic parallelism adaptation for rollout, dynamic resource allocation
and scheduling for reward, and stream-based training. Empirical results show
that RollPacker achieves a 2.03x-2.56x end-to-end training time reduction
compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5
family of LLMs on up to 128 H800 GPUs.

| Search Query: ArXiv Query: search_query=au:”Wei Gao”&id_list=&start=0&max_results=3