Published Nov 16, 2021

SDS 523: Open-Source Analytical Computing (pandas, Apache Arrow) — with Wes McKinney

Wes McKinney, creator of the pandas library, reflects on his journey in open-source analytics, highlighting the evolution of pandas and Apache Arrow and their transformative impact on data science. Discover insights into open-source development, data processing with Python, and how community-driven projects enhance scalability and innovation.
Episode Highlights
Super Data Science: ML & AI Podcast with Jon Krohn logo

Popular Clips

Episode Highlights

  • Origins

    Wes McKinney shares the origins of the pandas library, highlighting its roots in his work at AQR Capital Management during the 2007 financial crisis. He explains how pandas was inspired by R's data frames, aiming to bring similar functionality to Python. The library has since evolved significantly, with a dedicated team of developers enhancing its features and performance.

    I got a job in quantitative finance at AQR, Capital Management... right as the financial crisis was beginning.

    ---

    Wes emphasizes the importance of community contributions in the ongoing development of pandas, noting that he has not been the sole maintainer since 2013 1 2.

       

    Community

    The open-source community has played a crucial role in the evolution of pandas. Wes McKinney highlights the contributions of core developers and the importance of community-driven events like documentation hackathons. These efforts have made pandas more accessible and robust, fostering a vibrant ecosystem of contributors.

    Pandas has become this essential glue between different types of systems.

    ---

    Jon Krohn adds that pandas is now used by over half a million projects on GitHub, underscoring its widespread adoption and impact 3 4.

       

    Challenges

    Scaling pandas to handle large datasets presents significant challenges. Wes McKinney discusses issues like memory use and performance bottlenecks, particularly when dealing with data sizes beyond a gigabyte. He mentions projects like Dask and IBIs that aim to address these limitations by enabling distributed computing and scale-out processing.

    It's hard to make everybody happy in a tool like pandas.

    ---

    Despite these challenges, the ecosystem continues to evolve, with ongoing efforts to improve pandas' scalability and efficiency 5 3.

Related Episodes