Published Jul 26, 2022

SDS 595: Data Engineering 101 — with Joe Reis and Matt Housley

Joe Reis and Matt Housley dive into the core principles of data engineering, sharing key insights from their book and discussing essential strategies for efficient data management, communication, and tool selection. With a focus on collaboration and best practices, they unravel the complexity of the data lifecycle.
Episode Highlights
Super Data Science: ML & AI Podcast with Jon Krohn logo

Popular Clips

Episode Highlights

  • Role & Definition

    Data engineering is the backbone of transforming raw data into high-quality, consistent information for downstream use cases like analysis, machine learning, and reporting. emphasizes that data engineering involves developing, implementing, and maintaining systems that process raw data into usable formats 1. He criticizes past definitions that narrowly focused on specific technologies like Hadoop and Spark, arguing that true data engineering is about managing data flows and serving end users 2. adds that a data engineer's role is to make data useful for data scientists and other users 1.

    Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high quality, consistent information that supports downstream use cases.

    ---

    This redefined perspective aims to clarify the true essence of data engineering beyond mere tool usage.

       

    Data Lifecycle

    Understanding the data engineering lifecycle is crucial for optimizing data processes. explains that data engineers flip the traditional data science funnel by taking on tasks like data cleaning and munging, allowing data scientists to focus on modeling and analysis 3. He also highlights the overlap between data engineering and machine learning engineering, noting that ML engineers often pick up where data engineers leave off 4.

    Data engineers really serve the purpose of flipping the funnel on its head of what a data scientist is expected to do.

    ---

    This collaboration ensures that data is efficiently processed and ready for advanced analytics and machine learning tasks.

       

    Critical Undercurrents

    Critical undercurrents like security, data management, and orchestration drive the data engineering lifecycle. identifies security as a primary concern, followed by comprehensive data management practices 5. elaborates on orchestration, comparing it to managing a subway system to ensure data processes run smoothly without collisions 5. Dependency management is another key aspect, ensuring that all data components work harmoniously together 6.

    Orchestration for data is the same thing. It's basically a switchyard manager that says, okay, first I need to ingest the data, then I need to post process it.

    ---

    These elements are essential for maintaining the integrity and efficiency of data pipelines.

Related Episodes