Published Dec 13, 2022

635: The Perils of Manually Labeling Data for Machine Learning Models — with Shayan Mohanty

Shayan Mohanty dives into the inefficiencies and biases of manual data labeling in machine learning, advocating for automated solutions to enhance accuracy and reduce labor dependency while discussing the role of the Chomsky hierarchy in efficient data management.
Episode Highlights
Super Data Science: ML & AI Podcast with Jon Krohn logo

Popular Clips

Episode Highlights

  • Hand Labeling Issues

    highlights the inefficiencies and biases inherent in hand labeling data for machine learning models. He explains that while hand labeling is often considered the "ground truth," it is fraught with inconsistencies unless multiple layers of human auditing are involved 1. This process is not only time-consuming but also unreliable, as it depends heavily on human judgment, which can be flawed 2. Shayan suggests that relying on hand labeling as a foundation for machine learning models is dangerous due to its inherent flaws 3.

    Our whole thing is that everything about supervised machine learning has, in one way or another, been built up on this idea of ground truth. And we think that's actually kind of dangerous.

    ---

    He advocates for more automated solutions to empower users with the right tools, reducing the dependency on human-labeled data 3.

       

    Economic Impact

    The societal and economic impacts of hand labeling are significant, affecting labor markets and international dynamics. points out that the reliance on low-wage workers for data labeling creates a "ghost work" economy, where workers in countries like Bangladesh and various African nations are employed without growth opportunities 4. This practice leads to a race to the bottom in wages, as companies aim to minimize costs 5. Shayan argues that this system not only exploits workers but also hinders the broader adoption of AI due to data bottlenecks 6.

    There's this book called Ghost Work, which I recommend folks read if they're interested in this topic, but it details this idea of a ghost worker, or really a second class citizen in the Internet age.

    ---

    He emphasizes the need for more sustainable and equitable solutions to data labeling that do not rely on exploitative labor practices 4.

       

    Automation Benefits

    Transitioning to automated labeling processes offers a promising solution to the challenges posed by hand labeling. discusses how automation can reduce repetitive human labor and increase efficiency, allowing experts to focus on more meaningful tasks 7. He introduces the concept of machine teaching, which shifts the focus from building the best models to enhancing the teaching process itself 8. This approach not only speeds up data processing but also improves the quality of labeled data by using heuristics and probabilistic methods 9.

    Long story short, hand labeling is hopefully on its way out, and instead we'll have more sustainable processes.

    ---

    Shayan's company, Watchful, exemplifies this shift by providing tools that aid in the creation of labeling functions, making the process more accessible and efficient 8.

Related Episodes