Kavli Affiliate: Max Tegmark | First 5 Authors: Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan | Summary: Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents […]
Continue.. Dense SAE Latents Are Features, Not Bugs