A Neural Scaling Law from Lottery Ticket Ensembling

Kavli Affiliate: Max Tegmark

| First 5 Authors: Ziming Liu, Max Tegmark, , ,

| Summary:

Neural scaling laws (NSL) refer to the phenomenon where model performance
improves with scale. Sharma & Kaplan analyzed NSL using approximation theory
and predict that MSE losses decay as $N^{-alpha}$, $alpha=4/d$, where $N$ is
the number of model parameters, and $d$ is the intrinsic input dimension.
Although their theory works well for some cases (e.g., ReLU networks), we
surprisingly find that a simple 1D problem $y=x^2$ manifests a different
scaling law ($alpha=1$) from their predictions ($alpha=4$). We opened the
neural networks and found that the new scaling law originates from lottery
ticket ensembling: a wider network on average has more "lottery tickets", which
are ensembled to reduce the variance of outputs. We support the ensembling
mechanism by mechanistically interpreting single neural networks, as well as
studying them statistically. We attribute the $N^{-1}$ scaling law to the
"central limit theorem" of lottery tickets. Finally, we discuss its potential
implications for large language models and statistical physics-type theories of

| Search Query: ArXiv Query: search_query=au:”Max Tegmark”&id_list=&start=0&max_results=3

Read More