Video Model Challenges

Pre-training video generation models is incredibly GPU intensive, requiring advanced hardware like H100 GPUs. While certain capabilities, such as walking, only emerge at higher parameter scales, models like Mochi one strike a balance by being accessible yet powerful, operating on consumer-grade GPUs. As video generation involves long sequence lengths, the computational demands increase with each iteration, similar to language models generating tokens.