The discussion highlights the importance of the holistic evaluation of language models (helm) and the challenges posed by contamination in testing. Closed-source models create uncertainty regarding the training data, which can inflate performance metrics. The leap in performance from GPT 3.5 to GPT 4 on standardized tests, such as the bar exam, underscores the need for careful evaluation to ensure that models are genuinely being tested on unseen data.