Distillation Through Attention

Hugo discusses the challenges of matching features between convnets and transformers for distillation. Exploring the idea of distilling attention matrices and intermediate representations, Hugo raises questions about how hard to teach students in this process.