Kavli Affiliate: Max Tegmark
| First 5 Authors: Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark,
| Summary:
We propose the Quantization Model of neural scaling laws, explaining both the
observed power law dropoff of loss with model and data size, and also the
sudden emergence of new capabilities with scale. We derive this model from what
we call the Quantization Hypothesis, where network knowledge and skills are
"quantized" into discrete chunks ($textbf{quanta}$). We show that when quanta
are learned in order of decreasing use frequency, then a power law in use
frequencies explains observed power law scaling of loss. We validate this
prediction on toy datasets, then study how scaling curves decompose for large
language models. Using language model gradients, we automatically decompose
model behavior into a diverse set of skills (quanta). We tentatively find that
the frequency at which these quanta are used in the training distribution
roughly follows a power law corresponding with the empirical scaling exponent
for language models, a prediction of our theory.
| Search Query: ArXiv Query: search_query=au:”Max Tegmark”&id_list=&start=0&max_results=3