Published Feb 28, 2024

SE Radio 605: Yingjun Wu on Streaming Databases

Yingjun Wu of RisingWave Labs delves into the transformative power of streaming databases, exploring architectural differences, dynamic scaling, and schema adaptability for real-time insights. He discusses the balance between cost efficiency and performance, addressing data processing challenges like out-of-order events with innovative tools such as watermarks and parallel data consumption.
Episode Highlights
Software Engineering Radio - the podcast for professional software developers logo

Popular Clips

Episode Highlights

  • Out-of-Order Events

    Handling out-of-order events is a critical challenge in streaming databases, as it can affect the accuracy of results. explains that streaming databases use a technology called watermark to manage this issue. This mechanism allows the system to maintain a buffer for data ingested within a specific time frame, ensuring that even late-arriving data can be included in the results if it falls within the watermark range 1.

    We use a mechanism called watermark.

    ---

    Events arriving after the watermark period may be discarded or buffered, depending on the implementation, to maintain data consistency 1.

       

    Data Ingestion

    Data ingestion in streaming databases involves various methods and protocols to handle high-frequency data efficiently. highlights the use of Kafka and CDC (Change Data Capture) as common methods for ingesting data, allowing systems to consume data directly from messaging queues or databases 2. To manage high data volumes, parallel data consumption is employed, distributing data ingestion across multiple machines 2.

    For data ingestion, we typically ingest the data from Kafka.

    ---

    Data connectors play a crucial role in maintaining data quality and consistency, enabling seamless data flow from various sources 3.

       

    Deduplication

    Deduplication is essential in streaming databases to ensure data accuracy and efficiency. describes how streaming databases track data offsets to identify and discard duplicate entries, maintaining a clear view of processed data 4. This approach prevents redundant processing and ensures that only unique data is considered in real-time analytics.

    We actually will track the offset of the data.

    ---

    As streaming databases evolve, they continue to enhance their capabilities, integrating with cloud technologies to offer more powerful and efficient data processing solutions 4.

       

    Integrated Challenges

    Managing out-of-order events and efficient data ingestion are intertwined challenges in streaming databases. explains that using watermarks helps in handling out-of-order events by buffering data within a specific timeframe, ensuring accurate results 1. Meanwhile, data ingestion methods like Kafka and CDC facilitate the seamless flow of data into the system, supporting high-frequency applications such as stock tracking and manufacturing 2.

    We use a mechanism called watermark.

    ---

    These techniques together enhance the robustness and reliability of streaming databases in processing real-time data.

Related Episodes