Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Kavli Affiliate: Ting Xu

| First 5 Authors: Chuanyang Zheng, Chuanyang Zheng, , ,

| Summary:

Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art
large language models (LLMs). Traditionally, MoE relies on $mathrmSoftmax$
as the router score function to aggregate expert output, a designed choice that
has persisted from the earliest MoE models to modern LLMs, and is now widely
regarded as standard practice. However, the necessity of using
$mathrmSoftmax$ to project router weights into a probability simplex remains
an unchallenged assumption rather than a principled design choice. In this
work, we first revisit the classical Nadaraya-Watson regression and observe
that MoE shares the same mathematical formulation as Nadaraya-Watson
regression. Furthermore, we show that both feed-forward neural network (FFN)
and MoE can be interpreted as a special case of Nadaraya-Watson regression,
where the kernel function corresponds to the input neurons of the output layer.
Motivated by these insights, we propose the textbfzero-additional-cost
Kernel Inspired Router with Normalization (KERN), an FFN-style router function,
as an alternative to $mathrmSoftmax$. We demonstrate that this router
generalizes both $mathrmSigmoid$- and $mathrmSoftmax$-based routers.
textbfBased on empirical observations and established practices in FFN
implementation, we recommend the use of $mathrmReLU$ activation and
$ell_2$-normalization in $mathrmKERN$ router function. Comprehensive
experiments in MoE and LLM validate the effectiveness of the proposed FFN-style
router function methodNorm.

| Search Query: ArXiv Query: search_query=au:”Ting Xu”&id_list=&start=0&max_results=3