Published Apr 12, 2022

MLCommons’ David Kanter, NVIDIA’s Daniel Galvez on Publicly Accessible Datasets - Ep. 167

David Kanter and NVIDIA's Daniel Galvez delve into the democratization of machine learning through public datasets, highlighting innovations like the People's Speech and the Multilingual Spoken Words Corpus that advance AI research. They emphasize the importance of community collaboration and technological strides in reducing costs and enhancing speech recognition, crucial for accessible ML tools and improving global communication.

Episode Highlights

Topics covered

Episode Highlights

Data Access

and discuss the democratization of machine learning through public datasets. explains that MLCommons aims to make machine learning accessible by providing large, open datasets like the People's Speech and the Multilingual Spoken Words Corpus 1. These datasets are designed to be durable and evolve over time, much like a garden that needs tending 2.

We want to trim the flowers and prune them as appropriately, pull out the weeds and evolve it.

---

adds that these resources are crucial for researchers worldwide, even those at major tech companies, to drive innovation and maintain progress in machine learning 3.

Dataset Features

The People's Speech dataset and the Multilingual Spoken Words Corpus are groundbreaking in their scope and accessibility. highlights that the People's Speech dataset includes 30,000 hours of labeled audio, allowing for commercial use under a Creative Commons license 4. This dataset is unique in its inclusion of spontaneous speech, which presents challenges in transcription but offers a more authentic representation of language use 1.

The people with speech is actually mostly spontaneous speech, which is a fairly new thing to have in a dataset.

---

emphasizes the importance of maintaining these datasets to ensure they remain relevant and useful for future research 2.

Language Diversity

The Multilingual Spoken Words Corpus significantly advances language representation in AI by including 50 languages, many of which are underrepresented in existing datasets. notes that this corpus is the only open-source dataset for 46 of these languages, marking a major step forward for inclusivity in AI research 5. This diversity is crucial for developing more accurate and representative machine learning models 6.

There are a lot of languages that are widely spoken that are underrepresented, frankly.

---

envisions a future where these datasets are widely adopted, driving innovation and surprising applications in the field 5.

Related Episodes

Glean Founders Talk AI-Powered Enterprise Search on NVIDIA Podcast - Ep. 190
Answers 383 questions
NVIDIA’s Jim Fan Delves Into Large Language Models and Their Industry Impact - Ep. 204
Answers 383 questions
NVIDIA Research's David Luebke on Intersection of Graphics, AI - Ep. 127
Answers 383 questions
MosaicML's Naveen Rao on Making Custom LLMs More Accessible - Ep. 199
Answers 383 questions
NVIDIA’s Clément Farabet on Orchestrating AI Training for Autonomous Vehicles - Ep. 175
Answers 383 questions
Ep. 2: Where Deep Learning Goes Next - Bryan Catanzaro, NVIDIA Applied Deep Learning Research
Answers 383 questions
Artem Cherkasov and Olexandr Isayev on Democratizing Drug Discovery with Deep Learning - Ep. 172
Answers 383 questions
Demystifying AI with NVIDIA’s Will Ramey - Ep. 113
Answers 383 questions
NVIDIA Chief Scientist Bill Dally on Where AI Goes Next - Ep. 62
Answers 383 questions
Snowflake's Baris Gultekin on Unlocking the Value of Data With Large Language Models - Ep. 231
Answers 383 questions
NVIDIA’s Shalini De Mello Talks Self-Supervised AI, NeurIPS Successes - Ep. 140
Answers 383 questions
Serkan Piantino’s Company Makes AI for Everyone - Ep. 106
Answers 383 questions
Behind the Scenes at NeurIPS with NVIDIA and CalTech’s Anima Anandkumar - Ep. 131
Answers 383 questions
NVIDIA’s Annamalai Chockalingam on the Rise of LLMs - Ep. 206
Answers 383 questions
Making "Iron Man" Interface Real: AI-Based Virtualitics Demystifies Data Science with VR - Ep. 92
Answers 383 questions

Dexa/The AI Podcast

MLCommons’ David Kanter, NVIDIA’s Daniel Galvez on Publicly Accessible Datasets - Ep. 167

Topics covered

Popular Clips

Speech Technology Evolution

Creative Commons Licenses

Language Diversity in AI

Voice Recognition Challenges

Speech Recognition Breakthrough

Data Centric AI

Data as a Garden

Democratizing Machine Learning

Cost-Effective Labeling

Open Data Initiatives

Building ML Communities

Spontaneous Speech Insights

Episode Highlights

Public Dataset Importance

Data Access

Dataset Features

Language Diversity

Community and Collaboration

Technological Innovations

Related Episodes

Glean Founders Talk AI-Powered Enterprise Search on NVIDIA Podcast - Ep. 190

NVIDIA’s Jim Fan Delves Into Large Language Models and Their Industry Impact - Ep. 204

NVIDIA Research's David Luebke on Intersection of Graphics, AI - Ep. 127

MosaicML's Naveen Rao on Making Custom LLMs More Accessible - Ep. 199

NVIDIA’s Clément Farabet on Orchestrating AI Training for Autonomous Vehicles - Ep. 175

Ep. 2: Where Deep Learning Goes Next - Bryan Catanzaro, NVIDIA Applied Deep Learning Research

Artem Cherkasov and Olexandr Isayev on Democratizing Drug Discovery with Deep Learning - Ep. 172

Demystifying AI with NVIDIA’s Will Ramey - Ep. 113

NVIDIA Chief Scientist Bill Dally on Where AI Goes Next - Ep. 62

Snowflake's Baris Gultekin on Unlocking the Value of Data With Large Language Models - Ep. 231

NVIDIA’s Shalini De Mello Talks Self-Supervised AI, NeurIPS Successes - Ep. 140

Serkan Piantino’s Company Makes AI for Everyone - Ep. 106

Behind the Scenes at NeurIPS with NVIDIA and CalTech’s Anima Anandkumar - Ep. 131

NVIDIA’s Annamalai Chockalingam on the Rise of LLMs - Ep. 206

Making "Iron Man" Interface Real: AI-Based Virtualitics Demystifies Data Science with VR - Ep. 92

MLCommons’ David Kanter, NVIDIA’s Daniel Galvez on Publicly Accessible Datasets - Ep. 167

Topics covered

Popular Clips

Episode Highlights

Public Dataset ImportanceDavid Kanter and Daniel Galvez discuss the democratization of machine learning through public datasets. They explore the significance of the People's Speech and Multilingual Spoken Words Corpus in advancing AI research and language representation.

Public Dataset Importance

Data Access

Dataset Features

Language Diversity

Community and Collaboration

Technological InnovationsDavid Kanter and Daniel Galvez discuss innovative methods to reduce the cost of labeling datasets and advancements in speech recognition technology. These developments are crucial for democratizing access to machine learning tools and enhancing global communication.

Technological Innovations

Related Episodes