Language Model Insights
Neel discusses innovative methods for understanding language model neurons, emphasizing how LLMs can generate explanations even in the absence of clear patterns. He highlights the intriguing concept of causal interventions, such as manipulating specific latent variables, and shares insights on the challenges of unlearning. The conversation dives into the potential of algorithmically classifying model outputs based on properties, revealing the complexities of language model behavior.In this clip
From this podcast

Machine Learning Street Talk (MLST)
Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders)
Related Questions