Published Jul 8, 2021

Roger & DJ — The Rise of Big Data and CA's COVID-19 Response

Roger Magoulas and DJ Patil dive into the transformative world of big data, discussing its evolution, the shift from Hadoop to Spark, and strategic community-driven innovations, while detailing California's data-centric approach to managing the COVID-19 crisis.

Episode Highlights

Topics covered

Episode Highlights

Hadoop Issues

The limitations of Hadoop became apparent as data storage needs evolved. recalls the challenges faced at eBay, where storing all user data was impractical, leading to the erasure of 99.9% of it 1. This inefficiency prompted a shift towards Hadoop, though it wasn't without its issues. DJ highlights the need for more sophisticated data models during California's COVID-19 response, emphasizing the inadequacy of one-size-fits-all models 2.

Every time we want to do something interesting, we have to go to the lords of the data warehouse and ask permission.

---

The need for adaptable and scalable data solutions became evident, pushing the industry to seek alternatives.

Spark Shift

The transition from Hadoop to Spark marked a significant evolution in data processing. explains that Hadoop's limitations as a write engine necessitated a shift to Spark, which offered better analytics support and in-memory processing 3. This shift was further facilitated by Spark's integration with Python, making it accessible to a broader range of developers. Roger also discusses the role of NoSQL databases, noting their utility but emphasizing the importance of structured data for effective analytics 4.

Spark was just better at that than Hadoop was.

---

The transition to Spark represented a move towards more efficient and user-friendly data processing tools.

Open Source

Open source communities have played a crucial role in advancing data infrastructure technologies. emphasizes the collaborative nature of these communities, where sharing skills and techniques leads to collective improvement 5. adds that open source initiatives like Hadoop and Spark have democratized access to powerful tools, enabling widespread participation in data science 6.

The community owns this collectively.

---

This collaborative spirit has been instrumental in driving innovation and making advanced data technologies accessible to a global audience.

Related Episodes

Daphne Koller — Digital Biology and the Next Epoch of Science
Answers 383 questions
Richard Socher — The Challenges of Making ML Work in the Real World
Answers 383 questions
Sean and Greg — Biology and ML for Drug Discovery
Answers 383 questions
Jeff Hammerbacher — From data science to biomedicine
Answers 383 questions
Alyssa Simpson Rochwerger — Responsible ML in the Real World
Answers 383 questions
Robert Nishihara — The State of Distributed Computing in ML
Answers 383 questions
Dave Rogenmoser & Saad Ansari on Growing & Maintaining Jasper AI
Answers 383 questions
Accelerating drug discovery with AI: Insights from Isomorphic Labs
Answers 383 questions
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Answers 383 questions
D. Sculley — Technical Debt, Trade-offs, and Kaggle
Answers 383 questions
Cade Metz — The Stories Behind the Rise of AI
Answers 383 questions
Vicki Boykis — Machine Learning Across Industries
Answers 383 questions
The Power of AI in Search with You.com's Richard Socher
Answers 383 questions
Johannes Otterbach — Unlocking ML for Traditional Companies
Answers 383 questions
Bharath Ramsundar of Deep Forest Sciences — Deep Learning for Molecules and Medicine Discovery
Answers 383 questions

Roger & DJ — The Rise of Big Data and CA's COVID-19 Response

Topics covered

Popular Clips

Episode Highlights

Big Data EvolutionRoger Magoulas and DJ Patil explore the evolution of big data and the role of data scientists in shaping modern data practices. They highlight the collaborative efforts that have driven the field forward, emphasizing the importance of community and innovation.

Big Data Evolution

Data Infrastructure ToolsThe discussion shifts to the limitations of Hadoop and the subsequent transition to Spark, highlighting the need for more efficient data processing tools. DJ Patil and Roger Magoulas explore the role of open source communities in advancing these technologies.

Data Infrastructure Tools

Hadoop Issues

Spark Shift

Open Source

COVID-19 Data Response

Related Episodes