Published Apr 12, 2022

MLCommons’ David Kanter, NVIDIA’s Daniel Galvez on Publicly Accessible Datasets - Ep. 167

David Kanter and NVIDIA's Daniel Galvez delve into the democratization of machine learning through public datasets, highlighting innovations like the People's Speech and the Multilingual Spoken Words Corpus that advance AI research. They emphasize the importance of community collaboration and technological strides in reducing costs and enhancing speech recognition, crucial for accessible ML tools and improving global communication.
Episode Highlights
The AI Podcast logo

Popular Clips

Episode Highlights

  • Data Access

    and discuss the democratization of machine learning through public datasets. explains that MLCommons aims to make machine learning accessible by providing large, open datasets like the People's Speech and the Multilingual Spoken Words Corpus 1. These datasets are designed to be durable and evolve over time, much like a garden that needs tending 2.

    We want to trim the flowers and prune them as appropriately, pull out the weeds and evolve it.

    ---

    adds that these resources are crucial for researchers worldwide, even those at major tech companies, to drive innovation and maintain progress in machine learning 3.

       

    Dataset Features

    The People's Speech dataset and the Multilingual Spoken Words Corpus are groundbreaking in their scope and accessibility. highlights that the People's Speech dataset includes 30,000 hours of labeled audio, allowing for commercial use under a Creative Commons license 4. This dataset is unique in its inclusion of spontaneous speech, which presents challenges in transcription but offers a more authentic representation of language use 1.

    The people with speech is actually mostly spontaneous speech, which is a fairly new thing to have in a dataset.

    ---

    emphasizes the importance of maintaining these datasets to ensure they remain relevant and useful for future research 2.

       

    Language Diversity

    The Multilingual Spoken Words Corpus significantly advances language representation in AI by including 50 languages, many of which are underrepresented in existing datasets. notes that this corpus is the only open-source dataset for 46 of these languages, marking a major step forward for inclusivity in AI research 5. This diversity is crucial for developing more accurate and representative machine learning models 6.

    There are a lot of languages that are widely spoken that are underrepresented, frankly.

    ---

    envisions a future where these datasets are widely adopted, driving innovation and surprising applications in the field 5.

Related Episodes