Published Feb 28, 2024

SE Radio 605: Yingjun Wu on Streaming Databases

Yingjun Wu of RisingWave Labs delves into the transformative power of streaming databases, exploring architectural differences, dynamic scaling, and schema adaptability for real-time insights. He discusses the balance between cost efficiency and performance, addressing data processing challenges like out-of-order events with innovative tools such as watermarks and parallel data consumption.

Episode Highlights

Topics covered

Episode Highlights

Out-of-Order Events

Handling out-of-order events is a critical challenge in streaming databases, as it can affect the accuracy of results. explains that streaming databases use a technology called watermark to manage this issue. This mechanism allows the system to maintain a buffer for data ingested within a specific time frame, ensuring that even late-arriving data can be included in the results if it falls within the watermark range 1.

We use a mechanism called watermark.

---

Events arriving after the watermark period may be discarded or buffered, depending on the implementation, to maintain data consistency 1.



Data Ingestion

Data ingestion in streaming databases involves various methods and protocols to handle high-frequency data efficiently. highlights the use of Kafka and CDC (Change Data Capture) as common methods for ingesting data, allowing systems to consume data directly from messaging queues or databases 2. To manage high data volumes, parallel data consumption is employed, distributing data ingestion across multiple machines 2.

For data ingestion, we typically ingest the data from Kafka.

---

Data connectors play a crucial role in maintaining data quality and consistency, enabling seamless data flow from various sources 3.



Deduplication

Deduplication is essential in streaming databases to ensure data accuracy and efficiency. describes how streaming databases track data offsets to identify and discard duplicate entries, maintaining a clear view of processed data 4. This approach prevents redundant processing and ensures that only unique data is considered in real-time analytics.

We actually will track the offset of the data.

---

As streaming databases evolve, they continue to enhance their capabilities, integrating with cloud technologies to offer more powerful and efficient data processing solutions 4.



Integrated Challenges

Managing out-of-order events and efficient data ingestion are intertwined challenges in streaming databases. explains that using watermarks helps in handling out-of-order events by buffering data within a specific timeframe, ensuring accurate results 1. Meanwhile, data ingestion methods like Kafka and CDC facilitate the seamless flow of data into the system, supporting high-frequency applications such as stock tracking and manufacturing 2.

We use a mechanism called watermark.

---

These techniques together enhance the robustness and reliability of streaming databases in processing real-time data.

Related Episodes

SE Radio 560: Sugu Sougoumarane on Distributed SQL Databases
Answers 383 questions
SE-Radio Episode 346: Stephan Ewen on Streaming Architecture
Answers 383 questions
SE Radio 623: Mike Freedman on TimescaleDB
Answers 383 questions
SE-Radio Episode 243: RethinkDB with Slava Akhmechet
Answers 383 questions
SE Radio 592: Jaxon Repp on Distributed Data Infrastructure
Answers 383 questions
SE Radio 561: Dan DeMers on Dataware
Answers 383 questions
SE Radio 601: Han Yuan on Reorganizations
Answers 383 questions
SE Radio 583: Lukas Fittl on Postgres Performance
Answers 383 questions
SE-Radio Episode 353: Max Neunhoffer on Multi-model databases and ArangoDB
Answers 383 questions
SE Radio 619: James Strong on Kubernetes Networking
Answers 383 questions
364: Peter Zaitsev on Choosing the Right Open Source Database
Answers 383 questions
Episode 417: Alex Petrov on Database Storage Engines
Answers 383 questions
SE-Radio Episode 344: Pat Helland on Web Scale
Answers 383 questions
Episode 194: Michael Hunger on Graph Databases
Answers 383 questions
Episode 504: Frank McSherry on Materialize
Answers 383 questions

SE Radio 605: Yingjun Wu on Streaming Databases

Topics covered

Popular Clips

Episode Highlights

Streaming Database Architecture

Cost Efficiency and Trade-offs

Data Processing ChallengesYingjun Wu discusses the complexities of handling out-of-order events and data ingestion in streaming databases. He highlights the use of watermarks and parallel data consumption to ensure accurate and efficient data processing.

Data Processing Challenges

Out-of-Order Events

Data Ingestion

Deduplication

Integrated Challenges

Related Episodes