SDS 595: Data Engineering 101 — with Joe Reis and Matt Housley

Topics covered
Popular Clips
Episode Highlights
Role & Definition
Data engineering is the backbone of transforming raw data into high-quality, consistent information for downstream use cases like analysis, machine learning, and reporting. emphasizes that data engineering involves developing, implementing, and maintaining systems that process raw data into usable formats 1. He criticizes past definitions that narrowly focused on specific technologies like Hadoop and Spark, arguing that true data engineering is about managing data flows and serving end users 2. adds that a data engineer's role is to make data useful for data scientists and other users 1.
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high quality, consistent information that supports downstream use cases.
---
This redefined perspective aims to clarify the true essence of data engineering beyond mere tool usage.
Data Lifecycle
Understanding the data engineering lifecycle is crucial for optimizing data processes. explains that data engineers flip the traditional data science funnel by taking on tasks like data cleaning and munging, allowing data scientists to focus on modeling and analysis 3. He also highlights the overlap between data engineering and machine learning engineering, noting that ML engineers often pick up where data engineers leave off 4.
Data engineers really serve the purpose of flipping the funnel on its head of what a data scientist is expected to do.
---
This collaboration ensures that data is efficiently processed and ready for advanced analytics and machine learning tasks.
Critical Undercurrents
Critical undercurrents like security, data management, and orchestration drive the data engineering lifecycle. identifies security as a primary concern, followed by comprehensive data management practices 5. elaborates on orchestration, comparing it to managing a subway system to ensure data processes run smoothly without collisions 5. Dependency management is another key aspect, ensuring that all data components work harmoniously together 6.
Orchestration for data is the same thing. It's basically a switchyard manager that says, okay, first I need to ingest the data, then I need to post process it.
---
These elements are essential for maintaining the integrity and efficiency of data pipelines.
Related Episodes


SDS 587: Data Engineering for Data Scientists — with Mark Freeman
Answers 383 questions

657: How to Learn Data Engineering — with Andreas Kretz (@andreaskayy)
Answers 383 questions

SDS 485: Financial Data Engineering — with Doug Eisenstein
Answers 383 questions

SDS 619: Tools for Deploying Data Models into Production — with Erik Bernhardsson
Answers 383 questions

SDS 615: How to Ace Your Data Science Interview — with Nick Singh
Answers 383 questions

SDS 561: Engineering Data APIs — with Nate Fox
Answers 383 questions

SDS 605: Upskilling in Data Science and Machine Learning — with Kian Katanforoosh
Answers 383 questions

SDS 433: Data Science Trends for 2021 — with Ben Taylor
Answers 383 questions

SDS 487: Fixing Dirty Data — with Susan Walsh
Answers 383 questions

SDS 517: Courses in Data Science and Machine Learning — with Sadie St. Lawrence
Answers 383 questions

SDS 555: Sports Analytics and 66 Days of Data with @KenJee_ds
Answers 383 questions

SDS 499: Data Meshes and Data Reliability — with Barr Moses
Answers 383 questions

SDS 483: Setting Yourself Apart in Data Science Interviews — with Andrew Jones
Answers 383 questions













