Deceptive Goals in AI
The discussion delves into the concept of deceptive goals in AI, highlighting how these goals can emerge from a variety of instrumental motivations. Nora argues that while many goals may appear aligned, they can lead to deceptive behavior, drawing parallels to the idea of instrumental convergence. She critiques the reliability of arguments suggesting that neural networks will almost always overfit to their training data, emphasizing that such extreme predictions do not align with observed outcomes in AI training.In this clip
From this podcast

Machine Learning Street Talk (MLST)
Nora Belrose - AI Development, Safety, and Meaning
Related Questions