Model Training Insights

Akshita discusses the complexities of training larger models, highlighting the impact of wait time on loss curves. She notes that while sharing weights between the embedding and output layers works effectively for smaller models, it presents challenges at the seven billion parameter scale. The findings suggest that different model sizes require distinct approaches to achieve stability in training.