Evaluating LLMs

Rosanne highlights the challenges of using benchmarks like Big Bench, emphasizing the rapid evolution of evaluation methods in AI. The conversation reveals a tension between transparency and the potential for bias, as closed evaluations can lack neutrality. Both discuss the irony of new models claiming state-of-the-art status, only to be surpassed moments later, showcasing the ongoing complexities in assessing AI performance.