Kavli Affiliate: Max Tegmark | First 5 Authors: Joshua Engels, Logan Riggs, Max Tegmark, , | Summary: Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in "dark matter": unexplained variance in activations. This work investigates dark […]
Continue.. Decomposing The Dark Matter of Sparse Autoencoders