The Remarkable Robustness of LLMs: Stages of Inference? – Kavli Institute Pre-Print Publications

Kavli Affiliate: Max Tegmark

| First 5 Authors: Vedang Lad, Jin Hwa Lee, Wes Gurnee, Max Tegmark,

| Summary:

We investigate the robustness of Large Language Models (LLMs) to structural
interventions by deleting and swapping adjacent layers during inference.
Surprisingly, models retain 72-95% of their original top-1 prediction accuracy
without any fine-tuning. We find that performance degradation is not uniform
across layers: interventions to the early and final layers cause the most
degradation, while the model is remarkably robust to dropping middle layers.
This pattern of localized sensitivity motivates our hypothesis of four stages
of inference, observed across diverse model families and sizes: (1)
detokenization, where local context is integrated to lift raw token embeddings
into higher-level representations; (2) feature engineering, where task- and
entity-specific features are iteratively refined; (3) prediction ensembling,
where hidden states are aggregated into plausible next-token predictions; and
(4) residual sharpening, where irrelevant features are suppressed to finalize
the output distribution. Synthesizing behavioral and mechanistic evidence, we
provide a framework for interpreting depth-dependent computations in LLMs.

| Search Query: ArXiv Query: search_query=au:”Max Tegmark”&id_list=&start=0&max_results=3