Published Nov 16, 2021

SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow) — with Wes McKinney

Wes McKinney, creator of the pandas library, reflects on his journey in open-source analytics, highlighting the evolution of pandas and Apache Arrow and their transformative impact on data science. Discover insights into open-source development, data processing with Python, and how community-driven projects enhance scalability and innovation.

Episode Highlights

Topics covered

Episode Highlights

Origins

Wes McKinney shares the origins of the pandas library, highlighting its roots in his work at AQR Capital Management during the 2007 financial crisis. He explains how pandas was inspired by R's data frames, aiming to bring similar functionality to Python. The library has since evolved significantly, with a dedicated team of developers enhancing its features and performance.

I got a job in quantitative finance at AQR, Capital Management... right as the financial crisis was beginning.

---

Wes emphasizes the importance of community contributions in the ongoing development of pandas, noting that he has not been the sole maintainer since 2013 1 2.

Community

The open-source community has played a crucial role in the evolution of pandas. Wes McKinney highlights the contributions of core developers and the importance of community-driven events like documentation hackathons. These efforts have made pandas more accessible and robust, fostering a vibrant ecosystem of contributors.

Pandas has become this essential glue between different types of systems.

---

Jon Krohn adds that pandas is now used by over half a million projects on GitHub, underscoring its widespread adoption and impact 3 4.

Challenges

Scaling pandas to handle large datasets presents significant challenges. Wes McKinney discusses issues like memory use and performance bottlenecks, particularly when dealing with data sizes beyond a gigabyte. He mentions projects like Dask and IBIs that aim to address these limitations by enabling distributed computing and scale-out processing.

It's hard to make everybody happy in a tool like pandas.

---

Despite these challenges, the ecosystem continues to evolve, with ongoing efforts to improve pandas' scalability and efficiency 5 3.

Related Episodes

675: Pandas for Data Analysis and Visualization — with Stefanie Molin
Answers 383 questions
SDS 557: Effective Pandas — with Matt Harrison
Answers 383 questions
765: NumPy, SciPy and the Economics of Open-Source — with Dr. Travis Oliphant
Answers 383 questions
SDS 587: Data Engineering for Data Scientists — with Mark Freeman
Answers 383 questions
SDS 535: How to Found, Grow, and Sell a Data Science Start-up — with Austin Ogilvie
Answers 383 questions
SDS 433: Data Science Trends for 2021 — with Ben Taylor
Answers 383 questions
SDS 567: Open-Access Publishing — with Amy Brand
Answers 383 questions
SDS 581: Bayesian, Frequentist, and Fiducial Statistics in Data Science — with Xiao-Li Meng
Answers 383 questions
SDS 593: The Real-World Impact of Cross-Disciplinary Data Science Collaboration — with Philip Bourne
Answers 383 questions
SDS 595: Data Engineering 101 — with Joe Reis and Matt Housley
Answers 383 questions
SDS 537: Data Science Trends for 2022 — with Sadie St. Lawrence
Answers 383 questions
SDS 571: Collaborative, No-Code Machine Learning — with Tim Kraska
Answers 383 questions
SDS 511: Data Science for Private Investing — LIVE with Drew Conway
Answers 383 questions
SDS 493: Bringing Data to the People — with Anjali Shrivastava
Answers 383 questions
SDS 575: Optimizing Computer Hardware with Deep Learning — with Magnus Ekman
Answers 383 questions

SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow) — with Wes McKinney

Topics covered

Popular Clips

Episode Highlights

Career and Personal InsightsWes McKinney, the creator of the pandas library, shares his career journey, key decisions, and milestones in the world of open-source analytical computing.

Career and Personal Insights

Pandas EvolutionWes McKinney discusses the history and evolution of the pandas library, emphasizing its origins, community contributions, and the challenges faced in scaling it for large datasets.

Pandas Evolution

Origins

Community

Challenges

Data Science ToolsWes McKinney explores the Python ecosystem, highlighting the development of pandas and the integration of advanced hardware for data processing. He also discusses the importance of digital note-taking and collaboration tools in modern data science workflows.

Data Science Tools

Open Source Development

Apache ArrowWes McKinney provides an in-depth look at Apache Arrow, detailing its origins, technical innovations, and the vibrant community supporting its development. He explains how Arrow aims to revolutionize data science by enhancing data efficiency and scalability.

Apache Arrow

Related Episodes