Published Aug 18, 2023

706: Large Language Model Leaderboards and Benchmarks — with Caterina Constantinescu

Caterina Constantinescu dives into the complexities of evaluating large language models, comparing innovative platforms like Chatbot Arena and HELM, and highlighting the importance of human feedback, benchmark diversity, and dataset integrity for fair model assessment.

Episode Highlights

Topics covered

Popular Clips

Episode Highlights

Dataset Issues

Caterina Constantinescu highlights the challenges of dataset contamination in evaluating large language models (LLMs). She explains that many state-of-the-art models are closed source, leading to uncertainty about the data used in their training 1. This uncertainty raises concerns about whether evaluation datasets inadvertently include data the models have already seen, potentially inflating performance results. Jon Krohn adds that models like GPT-4, trained on vast internet data, might already contain answers to evaluation questions, complicating the assessment of true model capabilities 1.

If the algorithm's been trained on everything on the Internet, probably the questions on any evaluation, and the answers are already in there even more.

---

Caterina emphasizes the need for transparency in model training data to ensure fair evaluations 2.

Benchmark Evolution

The rapid evolution of LLMs necessitates continuous updates to benchmarks, as Caterina explains. She notes that benchmarks can quickly become obsolete if models are trained to excel on them, requiring ongoing refinement to accurately assess performance 3. Jon Krohn points out that as models improve, they may memorize benchmark solutions, making it crucial to develop new tests that reflect current capabilities 3.

There's this whole idea of there's probably never going to be a particular point in time where we can stop refining and updating these benchmarks.

---

The introduction of models like Lama 2, which outperform previous benchmarks despite smaller sizes, exemplifies the dynamic nature of LLM evaluation 4.

User Perception

Caterina discusses how user perceptions of LLM performance often diverge from standardized benchmarks. She notes that while benchmarks focus on metrics like accuracy, users may value creativity and usability, which are harder to quantify 2. This gap highlights the need for evaluations that consider real-world user experiences alongside traditional metrics. Jon Krohn introduces Caterina as a key figure in data science, emphasizing her contributions to understanding these evaluation challenges 5.

Creativity is not something you typically see in these benchmarks. And how would you even begin to measure creativity?

---

Caterina's insights underscore the importance of aligning model evaluations with user expectations to enhance practical applications 2.

Related Episodes

784: Aligning Large Language Models — with Sinan Ozdemir
Answers 383 questions
670: LLaMA: GPT-3 performance, 10x smaller — with Jon Krohn (@JonKrohnLearns)
Answers 383 questions
678: StableLM: Open-source "ChatGPT"-like LLMs you can fit on one GPU — with @JonKrohnLearns
Answers 383 questions
797: Deep Learning Classics and Trends — with Dr. Rosanne Liu
Answers 383 questions
SDS 549: Engineering Natural Language Models — with Lauren Zhu
Answers 383 questions
767: Open-Source LLM Libraries and Techniques — with Dr. Sebastian Raschka
Answers 383 questions
801: Merged LLMs Are Smaller And More Capable — with Arcee AI's Mark McQuade and Charles Goddard
Answers 383 questions
847: AI Engineering 101 — with Ed Donner
Answers 383 questions
787: MLOps: The Job and The Key Tools — with Demetrios Brinkmann
Answers 383 questions
788: Multi-Agent Systems: How Teams of LLMs Excel at Complex Tasks — with @JonKrohnLearns
Answers 383 questions
785: Math, Quantum ML and Language Embeddings — with Dr. Luis Serrano (@SerranoAcademy)
Answers 383 questions
747: Technical Intro to Transformers and LLMs — with Kirill Eremenko
Answers 383 questions
661: Designing Machine Learning Systems — with Chip Huyen
Answers 383 questions
694: CatBoost: Powerful, efficient ML for large tabular datasets — with Jon Krohn (@JonKrohnLearns)
Answers 383 questions
695: NLP with Transformers — with Hugging Face's Lewis Tunstall
Answers 383 questions

Dexa/Super Data Science: ML & AI Podcast with Jon Krohn

706: Large Language Model Leaderboards and Benchmarks — with Caterina Constantinescu

Topics covered

Popular Clips

Chatbot Arena Insights

Understanding Model Performance

Evolving Benchmarks

Evaluating Language Models

Evaluating Language Models

Benchmarking LLMs

Evaluating Model Performance

Conscious Will and AI

Navigating Model Evaluation

Episode Highlights

Chatbot Arena Evaluation

Benchmarking Platforms

LLM Performance Challenges

Dataset Issues

Benchmark Evolution

User Perception

Related Episodes

784: Aligning Large Language Models — with Sinan Ozdemir

670: LLaMA: GPT-3 performance, 10x smaller — with Jon Krohn (@JonKrohnLearns)

678: StableLM: Open-source "ChatGPT"-like LLMs you can fit on one GPU — with @JonKrohnLearns

797: Deep Learning Classics and Trends — with Dr. Rosanne Liu

SDS 549: Engineering Natural Language Models — with Lauren Zhu

767: Open-Source LLM Libraries and Techniques — with Dr. Sebastian Raschka

801: Merged LLMs Are Smaller And More Capable — with Arcee AI's Mark McQuade and Charles Goddard

847: AI Engineering 101 — with Ed Donner

787: MLOps: The Job and The Key Tools — with Demetrios Brinkmann

788: Multi-Agent Systems: How Teams of LLMs Excel at Complex Tasks — with @JonKrohnLearns

785: Math, Quantum ML and Language Embeddings — with Dr. Luis Serrano (@SerranoAcademy)

747: Technical Intro to Transformers and LLMs — with Kirill Eremenko

661: Designing Machine Learning Systems — with Chip Huyen

694: CatBoost: Powerful, efficient ML for large tabular datasets — with Jon Krohn (@JonKrohnLearns)

695: NLP with Transformers — with Hugging Face's Lewis Tunstall