Published Jul 8, 2021

Roger & DJ — The Rise of Big Data and CA's COVID-19 Response

Roger Magoulas and DJ Patil dive into the transformative world of big data, discussing its evolution, the shift from Hadoop to Spark, and strategic community-driven innovations, while detailing California's data-centric approach to managing the COVID-19 crisis.
Episode Highlights
Gradient Dissent - A Machine Learning Podcast logo

Popular Clips

Episode Highlights

  • Hadoop Issues

    The limitations of Hadoop became apparent as data storage needs evolved. recalls the challenges faced at eBay, where storing all user data was impractical, leading to the erasure of 99.9% of it 1. This inefficiency prompted a shift towards Hadoop, though it wasn't without its issues. DJ highlights the need for more sophisticated data models during California's COVID-19 response, emphasizing the inadequacy of one-size-fits-all models 2.

    Every time we want to do something interesting, we have to go to the lords of the data warehouse and ask permission.

    ---

    The need for adaptable and scalable data solutions became evident, pushing the industry to seek alternatives.

       

    Spark Shift

    The transition from Hadoop to Spark marked a significant evolution in data processing. explains that Hadoop's limitations as a write engine necessitated a shift to Spark, which offered better analytics support and in-memory processing 3. This shift was further facilitated by Spark's integration with Python, making it accessible to a broader range of developers. Roger also discusses the role of NoSQL databases, noting their utility but emphasizing the importance of structured data for effective analytics 4.

    Spark was just better at that than Hadoop was.

    ---

    The transition to Spark represented a move towards more efficient and user-friendly data processing tools.

       

    Open Source

    Open source communities have played a crucial role in advancing data infrastructure technologies. emphasizes the collaborative nature of these communities, where sharing skills and techniques leads to collective improvement 5. adds that open source initiatives like Hadoop and Spark have democratized access to powerful tools, enabling widespread participation in data science 6.

    The community owns this collectively.

    ---

    This collaborative spirit has been instrumental in driving innovation and making advanced data technologies accessible to a global audience.

Related Episodes