Published Jul 26, 2022

SDS 595: Data Engineering 101 — with Joe Reis and Matt Housley

Joe Reis and Matt Housley dive into the core principles of data engineering, sharing key insights from their book and discussing essential strategies for efficient data management, communication, and tool selection. With a focus on collaboration and best practices, they unravel the complexity of the data lifecycle.

Episode Highlights

Topics covered

Episode Highlights

Role & Definition

Data engineering is the backbone of transforming raw data into high-quality, consistent information for downstream use cases like analysis, machine learning, and reporting. emphasizes that data engineering involves developing, implementing, and maintaining systems that process raw data into usable formats 1. He criticizes past definitions that narrowly focused on specific technologies like Hadoop and Spark, arguing that true data engineering is about managing data flows and serving end users 2. adds that a data engineer's role is to make data useful for data scientists and other users 1.

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high quality, consistent information that supports downstream use cases.

---

This redefined perspective aims to clarify the true essence of data engineering beyond mere tool usage.

Data Lifecycle

Understanding the data engineering lifecycle is crucial for optimizing data processes. explains that data engineers flip the traditional data science funnel by taking on tasks like data cleaning and munging, allowing data scientists to focus on modeling and analysis 3. He also highlights the overlap between data engineering and machine learning engineering, noting that ML engineers often pick up where data engineers leave off 4.

Data engineers really serve the purpose of flipping the funnel on its head of what a data scientist is expected to do.

---

This collaboration ensures that data is efficiently processed and ready for advanced analytics and machine learning tasks.

Critical Undercurrents

Critical undercurrents like security, data management, and orchestration drive the data engineering lifecycle. identifies security as a primary concern, followed by comprehensive data management practices 5. elaborates on orchestration, comparing it to managing a subway system to ensure data processes run smoothly without collisions 5. Dependency management is another key aspect, ensuring that all data components work harmoniously together 6.

Orchestration for data is the same thing. It's basically a switchyard manager that says, okay, first I need to ingest the data, then I need to post process it.

---

These elements are essential for maintaining the integrity and efficiency of data pipelines.

Related Episodes

SDS 587: Data Engineering for Data Scientists — with Mark Freeman
Answers 383 questions
657: How to Learn Data Engineering — with Andreas Kretz (@andreaskayy)
Answers 383 questions
SDS 485: Financial Data Engineering — with Doug Eisenstein
Answers 383 questions
SDS 619: Tools for Deploying Data Models into Production — with Erik Bernhardsson
Answers 383 questions
SDS 615: How to Ace Your Data Science Interview — with Nick Singh
Answers 383 questions
SDS 623: Data Analyst, Data Scientist, and Data Engineer Career Paths — with @ShashankData
Answers 383 questions
SDS 561: Engineering Data APIs — with Nate Fox
Answers 383 questions
SDS 581: Bayesian, Frequentist, and Fiducial Statistics in Data Science — with Xiao-Li Meng
Answers 383 questions
SDS 605: Upskilling in Data Science and Machine Learning — with Kian Katanforoosh
Answers 383 questions
SDS 433: Data Science Trends for 2021 — with Ben Taylor
Answers 383 questions
SDS 487: Fixing Dirty Data — with Susan Walsh
Answers 383 questions
SDS 517: Courses in Data Science and Machine Learning — with Sadie St. Lawrence
Answers 383 questions
SDS 555: Sports Analytics and 66 Days of Data with @KenJee_ds
Answers 383 questions
SDS 499: Data Meshes and Data Reliability — with Barr Moses
Answers 383 questions
SDS 483: Setting Yourself Apart in Data Science Interviews — with Andrew Jones
Answers 383 questions

SDS 595: Data Engineering 101 — with Joe Reis and Matt Housley

Topics covered

Popular Clips

Episode Highlights

Data Engineering Foundations

Role & Definition

Data Lifecycle

Critical Undercurrents

Communication and CollaborationJoe Reis and Matt Housley emphasize the importance of effective communication with downstream stakeholders in data engineering. They discuss how cross-functional collaboration can significantly improve data engineering outcomes.

Communication and Collaboration

Choosing Tools and TechniquesJoe Reis and Matt Housley discuss the critical factors in selecting data engineering tools and their favorite tools and techniques. They also highlight best practices for efficient workflow and collaboration in data engineering.

Choosing Tools and Techniques

Related Episodes