WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Kavli Affiliate: Jia Liu

| First 5 Authors: Changxin Tian, Changxin Tian, , ,

| Summary:

Recent advances in learning rate (LR) scheduling have demonstrated the
effectiveness of decay-free approaches that eliminate the traditional decay
phase while maintaining competitive performance. Model merging techniques have
emerged as particularly promising solutions in this domain. We present
Warmup-Stable and Merge (WSM), a general framework that establishes a formal
connection between learning rate decay and model merging. WSM provides a
unified theoretical foundation for emulating various decay strategies-including
cosine decay, linear decay and inverse square root decay-as principled model
averaging schemes, while remaining fully compatible with diverse optimization
methods. Through extensive experiments, we identify merge duration-the training
window for checkpoint aggregation-as the most critical factor influencing model
performance, surpassing the importance of both checkpoint interval and merge
quantity. Our framework consistently outperforms the widely-adopted
Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving
significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on
MMLU-Pro. The performance advantages extend to supervised fine-tuning
scenarios, highlighting WSM’s potential for long-term model refinement.

| Search Query: ArXiv Query: search_query=au:”Jia Liu”&id_list=&start=0&max_results=3