An autoencoder where the bottleneck is English
The architectural twist is that both halves of the autoencoder are language models. An Activation Verbalizer takes a high-dimensional residual-stream activation from the target Claude model and produces a free-form text description; an Activation Reconstructor reads only that text and tries to rebuild the original activation. The two are jointly trained with reinforcement learning, with explanation quality measured by how accurately the reconstructor recovers the activation from text alone. This is structurally different from sparse autoencoders or attribution graphs, whose outputs still need expert interpretation. Here the latent code is plain English by construction. Trained NLAs reach roughly 0.6 to 0.8 fraction-of-variance-explained on residual-stream activations, and Anthropic released open-weight versions for Qwen 2.5 7B, Gemma 3 12B and 27B, and Llama 3.3 70B, with an interactive frontend hosted in collaboration with Neuronpedia. Hacker News commenter comex summarized it cleanly: training forces the verbalizer to develop a mapping from activations to tokens that the reconstructor can then invert back.


