Kavli Affiliate: Jia Liu | First 5 Authors: Changxin Tian, Changxin Tian, , , | Summary: Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert […]
Continue.. Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models