Jürgen explains a novel approach to reinforcement learning where rewards serve as commands for action sequences. By adjusting commands based on previous outcomes, the network learns to optimize its actions for maximum reward. This method emphasizes a structured exploration of the reward space, allowing the system to generalize and improve its performance over time through supervised learning principles.