706: Large Language Model Leaderboards and Benchmarks — with Caterina Constantinescu

Topics covered
Popular Clips
Episode Highlights
Dataset Issues
Caterina Constantinescu highlights the challenges of dataset contamination in evaluating large language models (LLMs). She explains that many state-of-the-art models are closed source, leading to uncertainty about the data used in their training 1. This uncertainty raises concerns about whether evaluation datasets inadvertently include data the models have already seen, potentially inflating performance results. Jon Krohn adds that models like GPT-4, trained on vast internet data, might already contain answers to evaluation questions, complicating the assessment of true model capabilities 1.
If the algorithm's been trained on everything on the Internet, probably the questions on any evaluation, and the answers are already in there even more.
---
Caterina emphasizes the need for transparency in model training data to ensure fair evaluations 2.
Benchmark Evolution
The rapid evolution of LLMs necessitates continuous updates to benchmarks, as Caterina explains. She notes that benchmarks can quickly become obsolete if models are trained to excel on them, requiring ongoing refinement to accurately assess performance 3. Jon Krohn points out that as models improve, they may memorize benchmark solutions, making it crucial to develop new tests that reflect current capabilities 3.
There's this whole idea of there's probably never going to be a particular point in time where we can stop refining and updating these benchmarks.
---
The introduction of models like Lama 2, which outperform previous benchmarks despite smaller sizes, exemplifies the dynamic nature of LLM evaluation 4.
User Perception
Caterina discusses how user perceptions of LLM performance often diverge from standardized benchmarks. She notes that while benchmarks focus on metrics like accuracy, users may value creativity and usability, which are harder to quantify 2. This gap highlights the need for evaluations that consider real-world user experiences alongside traditional metrics. Jon Krohn introduces Caterina as a key figure in data science, emphasizing her contributions to understanding these evaluation challenges 5.
Creativity is not something you typically see in these benchmarks. And how would you even begin to measure creativity?
---
Caterina's insights underscore the importance of aligning model evaluations with user expectations to enhance practical applications 2.
Related Episodes


784: Aligning Large Language Models — with Sinan Ozdemir
Answers 383 questions
670: LLaMA: GPT-3 performance, 10x smaller — with Jon Krohn (@JonKrohnLearns)
Answers 383 questions

797: Deep Learning Classics and Trends — with Dr. Rosanne Liu
Answers 383 questions

SDS 549: Engineering Natural Language Models — with Lauren Zhu
Answers 383 questions

767: Open-Source LLM Libraries and Techniques — with Dr. Sebastian Raschka
Answers 383 questions

847: AI Engineering 101 — with Ed Donner
Answers 383 questions

787: MLOps: The Job and The Key Tools — with Demetrios Brinkmann
Answers 383 questions

747: Technical Intro to Transformers and LLMs — with Kirill Eremenko
Answers 383 questions

661: Designing Machine Learning Systems — with Chip Huyen
Answers 383 questions

695: NLP with Transformers — with Hugging Face's Lewis Tunstall
Answers 383 questions








