Audio-Visual Learning
Ishan discusses how aligning audio and visual cues in videos can lead to powerful visual features, enabling distinctions between similar actions like cutting an onion or an apple based on sound cues. By improving the contrastive loss, the model can better understand visual similarity and groupings in videos.In this clip
From this podcast

Machine Learning Street Talk (MLST)
#55 Self-Supervised Vision Models (Dr. Ishan Misra - FAIR).
Related Questions