Delayed Generalization
Grokking reveals a fascinating phenomenon where a network's test performance can improve significantly after a prolonged training period, even when training metrics plateau. This suggests that during training, the network retains gradient information that eventually allows it to better generalize on unseen data. As training progresses, the network shifts from learning simple features to more complex ones, challenging the conventional understanding of how learning dynamics operate.In this clip
From this podcast

Machine Learning Street Talk (MLST)
Want to Understand Neural Networks? Think Elastic Origami! - Prof. Randall Balestriero
Related Questions