635: The Perils of Manually Labeling Data for Machine Learning Models — with Shayan Mohanty

Topics covered
Popular Clips
Episode Highlights
Hand Labeling Issues
highlights the inefficiencies and biases inherent in hand labeling data for machine learning models. He explains that while hand labeling is often considered the "ground truth," it is fraught with inconsistencies unless multiple layers of human auditing are involved 1. This process is not only time-consuming but also unreliable, as it depends heavily on human judgment, which can be flawed 2. Shayan suggests that relying on hand labeling as a foundation for machine learning models is dangerous due to its inherent flaws 3.
Our whole thing is that everything about supervised machine learning has, in one way or another, been built up on this idea of ground truth. And we think that's actually kind of dangerous.
---
He advocates for more automated solutions to empower users with the right tools, reducing the dependency on human-labeled data 3.
Economic Impact
The societal and economic impacts of hand labeling are significant, affecting labor markets and international dynamics. points out that the reliance on low-wage workers for data labeling creates a "ghost work" economy, where workers in countries like Bangladesh and various African nations are employed without growth opportunities 4. This practice leads to a race to the bottom in wages, as companies aim to minimize costs 5. Shayan argues that this system not only exploits workers but also hinders the broader adoption of AI due to data bottlenecks 6.
There's this book called Ghost Work, which I recommend folks read if they're interested in this topic, but it details this idea of a ghost worker, or really a second class citizen in the Internet age.
---
He emphasizes the need for more sustainable and equitable solutions to data labeling that do not rely on exploitative labor practices 4.
Automation Benefits
Transitioning to automated labeling processes offers a promising solution to the challenges posed by hand labeling. discusses how automation can reduce repetitive human labor and increase efficiency, allowing experts to focus on more meaningful tasks 7. He introduces the concept of machine teaching, which shifts the focus from building the best models to enhancing the teaching process itself 8. This approach not only speeds up data processing but also improves the quality of labeled data by using heuristics and probabilistic methods 9.
Long story short, hand labeling is hopefully on its way out, and instead we'll have more sustainable processes.
---
Shayan's company, Watchful, exemplifies this shift by providing tools that aid in the creation of labeling functions, making the process more accessible and efficient 8.
Related Episodes


661: Designing Machine Learning Systems — with Chip Huyen
Answers 383 questions

SDS 613: Causal Machine Learning — with Emre Kiciman
Answers 383 questions

649: Introduction to Machine Learning — with Kirill Eremenko and Hadelin de Ponteves
Answers 383 questions
SDS 464: A.I. vs Machine Learning vs Deep Learning — with Jon Krohn
Answers 383 questions

627: AutoML: Automated Machine Learning — with Erin LeDell
Answers 383 questions

SDS 599: MLOps: Machine Learning Operations — with @Miki_ML
Answers 383 questions

679: The A.I. and Machine Learning Landscape — with investor George Mathew
Answers 383 questions

721: Quantum Machine Learning — with Dr. Amira Abbas
Answers 383 questions

717: Overcoming Adversaries with A.I. for Cybersecurity — with Dr. Dan Shiebler
Answers 383 questions

SDS 435: Scaling Up Machine Learning — with Erica Greene
Answers 383 questions

SDS 583: The State of Natural Language Processing — with Rongyao Huang
Answers 383 questions

SDS 489: Monetizing Machine Learning — with Vin Vashishta
Answers 383 questions

SDS 439: Deep Learning for Machine Vision — with Deblina Bhattacharjee
Answers 383 questions

SDS 539: Interpretable Machine Learning — with Serg Masís
Answers 383 questions

647: Is Data Science Still Sexy? — with Tom Davenport
Answers 383 questions














