MLCommons’ David Kanter, NVIDIA’s Daniel Galvez on Publicly Accessible Datasets - Ep. 167

Topics covered
Popular Clips
Episode Highlights
Data Access
and discuss the democratization of machine learning through public datasets. explains that MLCommons aims to make machine learning accessible by providing large, open datasets like the People's Speech and the Multilingual Spoken Words Corpus 1. These datasets are designed to be durable and evolve over time, much like a garden that needs tending 2.
We want to trim the flowers and prune them as appropriately, pull out the weeds and evolve it.
---
adds that these resources are crucial for researchers worldwide, even those at major tech companies, to drive innovation and maintain progress in machine learning 3.
Dataset Features
The People's Speech dataset and the Multilingual Spoken Words Corpus are groundbreaking in their scope and accessibility. highlights that the People's Speech dataset includes 30,000 hours of labeled audio, allowing for commercial use under a Creative Commons license 4. This dataset is unique in its inclusion of spontaneous speech, which presents challenges in transcription but offers a more authentic representation of language use 1.
The people with speech is actually mostly spontaneous speech, which is a fairly new thing to have in a dataset.
---
emphasizes the importance of maintaining these datasets to ensure they remain relevant and useful for future research 2.
Language Diversity
The Multilingual Spoken Words Corpus significantly advances language representation in AI by including 50 languages, many of which are underrepresented in existing datasets. notes that this corpus is the only open-source dataset for 46 of these languages, marking a major step forward for inclusivity in AI research 5. This diversity is crucial for developing more accurate and representative machine learning models 6.
There are a lot of languages that are widely spoken that are underrepresented, frankly.
---
envisions a future where these datasets are widely adopted, driving innovation and surprising applications in the field 5.
Related Episodes


Glean Founders Talk AI-Powered Enterprise Search on NVIDIA Podcast - Ep. 190
Answers 383 questions
NVIDIA Research's David Luebke on Intersection of Graphics, AI - Ep. 127
Answers 383 questions
MosaicML's Naveen Rao on Making Custom LLMs More Accessible - Ep. 199
Answers 383 questions
Demystifying AI with NVIDIA’s Will Ramey - Ep. 113
Answers 383 questions
NVIDIA Chief Scientist Bill Dally on Where AI Goes Next - Ep. 62
Answers 383 questions

NVIDIA’s Shalini De Mello Talks Self-Supervised AI, NeurIPS Successes - Ep. 140
Answers 383 questions

Serkan Piantino’s Company Makes AI for Everyone - Ep. 106
Answers 383 questions

Behind the Scenes at NeurIPS with NVIDIA and CalTech’s Anima Anandkumar - Ep. 131
Answers 383 questions

NVIDIA’s Annamalai Chockalingam on the Rise of LLMs - Ep. 206
Answers 383 questions
