Latent Action Models

The discussion dives into the intricacies of learning a latent action model that operates on raw pixels, allowing for effective interaction with environments. By utilizing a video tokenizer to discretize frames, the model predicts tokens that facilitate next-frame predictions. The decision to limit the action space to eight enhances usability and maintains consistency across successive frames.