SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow) — with Wes McKinney

Topics covered
Popular Clips
Episode Highlights
Origins
Wes McKinney shares the origins of the pandas library, highlighting its roots in his work at AQR Capital Management during the 2007 financial crisis. He explains how pandas was inspired by R's data frames, aiming to bring similar functionality to Python. The library has since evolved significantly, with a dedicated team of developers enhancing its features and performance.
I got a job in quantitative finance at AQR, Capital Management... right as the financial crisis was beginning.
---
Wes emphasizes the importance of community contributions in the ongoing development of pandas, noting that he has not been the sole maintainer since 2013 1 2.
Community
The open-source community has played a crucial role in the evolution of pandas. Wes McKinney highlights the contributions of core developers and the importance of community-driven events like documentation hackathons. These efforts have made pandas more accessible and robust, fostering a vibrant ecosystem of contributors.
Pandas has become this essential glue between different types of systems.
---
Jon Krohn adds that pandas is now used by over half a million projects on GitHub, underscoring its widespread adoption and impact 3 4.
Challenges
Scaling pandas to handle large datasets presents significant challenges. Wes McKinney discusses issues like memory use and performance bottlenecks, particularly when dealing with data sizes beyond a gigabyte. He mentions projects like Dask and IBIs that aim to address these limitations by enabling distributed computing and scale-out processing.
It's hard to make everybody happy in a tool like pandas.
---
Despite these challenges, the ecosystem continues to evolve, with ongoing efforts to improve pandas' scalability and efficiency 5 3.
Related Episodes


675: Pandas for Data Analysis and Visualization — with Stefanie Molin
Answers 383 questions

SDS 557: Effective Pandas — with Matt Harrison
Answers 383 questions

765: NumPy, SciPy and the Economics of Open-Source — with Dr. Travis Oliphant
Answers 383 questions

SDS 587: Data Engineering for Data Scientists — with Mark Freeman
Answers 383 questions

SDS 535: How to Found, Grow, and Sell a Data Science Start-up — with Austin Ogilvie
Answers 383 questions

SDS 433: Data Science Trends for 2021 — with Ben Taylor
Answers 383 questions

SDS 567: Open-Access Publishing — with Amy Brand
Answers 383 questions

SDS 595: Data Engineering 101 — with Joe Reis and Matt Housley
Answers 383 questions

SDS 537: Data Science Trends for 2022 — with Sadie St. Lawrence
Answers 383 questions

SDS 571: Collaborative, No-Code Machine Learning — with Tim Kraska
Answers 383 questions

SDS 511: Data Science for Private Investing — LIVE with Drew Conway
Answers 383 questions

SDS 493: Bringing Data to the People — with Anjali Shrivastava
Answers 383 questions

SDS 575: Optimizing Computer Hardware with Deep Learning — with Magnus Ekman
Answers 383 questions














