Training involves encoding images and text into sequences of numbers, where the model predicts image outputs based on given prompts. The process is counterintuitive, as it aims to generate a specific image from a multitude of possibilities, relying on previous patches and learned associations to refine predictions. The challenge of predicting colors is addressed through cross-entropy loss, ensuring that the model does not default to simpler outputs like gray images.