Adversarial Attacks Insights
Sparse autoencoders (SAEs) show potential for adversarial attacks, but current evidence suggests they may not surpass the capabilities of fine-tuning methods. Recent advancements by companies like Anthropic and OpenAI have revealed scaling laws for SAEs, indicating improved training techniques and hyperparameter optimization. The exploration of these models continues to unfold, with the promise of new insights and applications on the horizon.In this clip
From this podcast

Machine Learning Street Talk (MLST)
Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders)
Related Questions