Published Sep 3, 2019

SE-Radio Episode 272: Frances Perry on Apache Beam

Delve into the innovative world of Apache Beam with insights from Frances Perry, as she reveals how this open-source model revolutionizes data processing by unifying batch and stream methods, overcoming the limitations of traditional architectures, and enhancing data accuracy through concepts like watermarks and event time skew.
Episode Highlights
Software Engineering Radio - the podcast for professional software developers logo

Popular Clips

Episode Highlights

  • Stream Processing

    Stream processing is a dynamic approach to handling data in real-time, allowing for continuous processing and immediate results. , a tech lead at Google Cloud Dataflow, explains that stream processing involves processing data as it arrives, which contrasts with traditional batch processing that handles data in large chunks 1. This method is crucial for applications requiring instant data insights, such as live analytics and monitoring systems. introduces the topic by highlighting its significance in modern software engineering 2.

       

    Watermarks

    Watermarks and windowing are essential concepts in stream processing, helping manage data timing and accuracy. Watermarks track how completely event data has been processed, allowing systems to handle late-arriving data without unnecessary delays 3. explains that windowing divides data into chunks based on event time, enabling precise data aggregation and analysis 4. She notes, "Watermarks let you very carefully track the distinction between event and processing time," which is vital for maintaining data integrity in real-time systems.

       

    Event Skew

    Event time skew presents challenges in stream processing by causing discrepancies between when data events occur and when they are processed. describes this skew as the difference between event time and processing time, which can lead to inaccuracies in real-time data analysis 5. To address this, systems often use batch processing to correct results, but this approach can be cumbersome and inefficient 6. Perry emphasizes the importance of developing systems that can handle late-arriving data without sacrificing accuracy or latency.

Related Episodes