🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

April 30, 2025

min read

Checkpoint Chronicle - April 2025

Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

Fraud detection is an often mentioned use case for Apache Flink. Shriram Ravichandran and Dharmateja Yarlagadda wrote two blog posts showing how to implement Flink jobs to detect suspicious transactions based on basic rules (part 1) or complex pattern matching (part 2).

Yaroslav Tkachenko started an article series to reflect on selected challenges in modern data streaming systems. Using Apache Flink as an example, part one elaborates on various efficiency related aspects while part two puts developer experience into the focus.

My recent article explains how to improve the overall testability of Apache Flink jobs. More specifically, it addresses how to (re)write (existing) Flink jobs in a more modular way by defining job-specific interfaces for the actual processing logic while at the same time making source and sink components pluggable.

Event Streaming

KIP-1150: Diskless Topics was added recently, with the primary aim to redirect replication from broker disks to cloud object storage, thereby significantly reducing total operational costs for Kafka clusters. Filip Yonov and Josep Prat wrote this article discussing several insightful aspects behind this idea.

Related to KIP-1150, Gunnar Morling blogged about “What If We Could Rebuild Kafka From Scratch?”. In this thought experiment, he comes up with his personal nine-items-wishlist for what a potential “Kafka.next” should look like.

KPipe is a new project providing a functional and high-performance Kafka consumer implementation built on top of modern Java features. It uses virtual threads, composable message processors, and DslJson.

Data Ecosystem

Ever wondered what makes Apache Arrow a good choice as a data interchange format for databases and query engines? Find out in this two part article series by Ian Cook, David Lee, and Matt Topol who discuss “How the Apache Arrow Format Accelerates Query Result Transfer” and “Data Wants to Be Free: Fast Data Exchange with Apache Arrow”.

With “How I’d Learn Apache Iceberg (if I Had To Start Over)”, Dunith Danushka shares a 7-week study plan for getting started with the extremely popular Apache Iceberg project.

Chris Riccomini recently wrote an insightful post on incremental view maintenance (IVM) and highlights its growing adoption in platforms like Materialize, PostgreSQL's pg_ivm extension, and tools such as Epsio or Feldera. He discusses key research contributions, including timely / differential dataflow, and DBSP, all of which have influenced the development of efficient IVM systems.

Data Platforms and Architecture

Vu Trinh provides a concise overview of how Meta modernized their lakehouse across their whole organization with the three primary goals being engineering efficiency, faster innovation, and better user experience. The TL;DR is that they moved away from fragmented query engines and different dialects towards a unified execution engine and standard SQL.

SeungMin Lee concluded a blog series and wrote about their “Iceberg Operation Journey: Takeaways for DB & Server Logs”. This third article shares the chosen configuration settings, their experiences with partitioning strategies, and how to monitor Iceberg-related metrics when loading different types of logs into Iceberg tables.

RDBMS and Change Data Capture

I recently stumbled upon drawDB, a really convenient in-browser editor for doing database modeling and SQL generation. Check out the GitHub repo.

Earlier this year, Daniil Roman shared a nice real-world use case explaining how their team used Debezium and Apache Kafka in a recent data migration project.

Debezium 3.1.0.Final is absolutely packed with good stuff. Most notably, this release ships Debezium Platform (short video walkthrough) which offers a modern and opinionated way to fully describe data pipelines - source/sink configurations and transformation chains - all running on top of Debezium Server and deployable to Kubernetes with Helm charts.

Following the general industry trend to ease various types of integrations with AI-related workloads, Debezium recently shipped improved vector data type support for sources and added JDBC sink support to write vector embeddings to MySQL and PostgreSQL. Besides, there is a new Debezium Server sink for Milvus and a dedicated vector embeddings transformation.

The Apache Flink Community published a helpful article about building Flink CDC pipelines on Alibaba’s Cloud blog. The first half provides a good first overview before switching to the hands-on part explaining how to build a data pipeline from MySQL to Paimon.

Paper of the Month

With “An Empirical Evaluation of Columnar Storage Formats” Xinyu Zeng et al. provide valuable insights into the design and optimization of columnar storage formats in the context of modern data processing needs. By highlighting the strengths and limitations of Parquet and ORC, the authors offer guidance for developing next-generation storage formats that are more aligned with current hardware capabilities and workload requirements. Their benchmark framework serves as a tool for future evaluations, ensuring that storage formats evolve to meet the demands of diverse and complex data environments.

Events & Call for Papers (CfP)

DevoxxUK (London, UK) May 7-9
Current 2025 (London, UK) May 20-21
Snowflake Summit 2025 (San Francisco, CA, USA) June 2-5
AI & Big Data Expo (Santa Clara, CA, USA) June 4-5‍
Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP open

New Releases

Debezium 3.1.0.Final and 3.1.1.Final‍
Apache Iceberg 1.7.2 and 1.9.0

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

March 27, 2025

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

Stream Processing, Streaming SQL, and Streaming Databases

Fraud detection is an often mentioned use case for Apache Flink. Shriram Ravichandran and Dharmateja Yarlagadda wrote two blog posts showing how to implement Flink jobs to detect suspicious transactions based on basic rules (part 1) or complex pattern matching (part 2).

Yaroslav Tkachenko started an article series to reflect on selected challenges in modern data streaming systems. Using Apache Flink as an example, part one elaborates on various efficiency related aspects while part two puts developer experience into the focus.

My recent article explains how to improve the overall testability of Apache Flink jobs. More specifically, it addresses how to (re)write (existing) Flink jobs in a more modular way by defining job-specific interfaces for the actual processing logic while at the same time making source and sink components pluggable.

Event Streaming

KIP-1150: Diskless Topics was added recently, with the primary aim to redirect replication from broker disks to cloud object storage, thereby significantly reducing total operational costs for Kafka clusters. Filip Yonov and Josep Prat wrote this article discussing several insightful aspects behind this idea.

Related to KIP-1150, Gunnar Morling blogged about “What If We Could Rebuild Kafka From Scratch?”. In this thought experiment, he comes up with his personal nine-items-wishlist for what a potential “Kafka.next” should look like.

KPipe is a new project providing a functional and high-performance Kafka consumer implementation built on top of modern Java features. It uses virtual threads, composable message processors, and DslJson.

Data Ecosystem

Ever wondered what makes Apache Arrow a good choice as a data interchange format for databases and query engines? Find out in this two part article series by Ian Cook, David Lee, and Matt Topol who discuss “How the Apache Arrow Format Accelerates Query Result Transfer” and “Data Wants to Be Free: Fast Data Exchange with Apache Arrow”.

With “How I’d Learn Apache Iceberg (if I Had To Start Over)”, Dunith Danushka shares a 7-week study plan for getting started with the extremely popular Apache Iceberg project.

Chris Riccomini recently wrote an insightful post on incremental view maintenance (IVM) and highlights its growing adoption in platforms like Materialize, PostgreSQL's pg_ivm extension, and tools such as Epsio or Feldera. He discusses key research contributions, including timely / differential dataflow, and DBSP, all of which have influenced the development of efficient IVM systems.

Data Platforms and Architecture

Vu Trinh provides a concise overview of how Meta modernized their lakehouse across their whole organization with the three primary goals being engineering efficiency, faster innovation, and better user experience. The TL;DR is that they moved away from fragmented query engines and different dialects towards a unified execution engine and standard SQL.

SeungMin Lee concluded a blog series and wrote about their “Iceberg Operation Journey: Takeaways for DB & Server Logs”. This third article shares the chosen configuration settings, their experiences with partitioning strategies, and how to monitor Iceberg-related metrics when loading different types of logs into Iceberg tables.

RDBMS and Change Data Capture

I recently stumbled upon drawDB, a really convenient in-browser editor for doing database modeling and SQL generation. Check out the GitHub repo.

Earlier this year, Daniil Roman shared a nice real-world use case explaining how their team used Debezium and Apache Kafka in a recent data migration project.

Debezium 3.1.0.Final is absolutely packed with good stuff. Most notably, this release ships Debezium Platform (short video walkthrough) which offers a modern and opinionated way to fully describe data pipelines - source/sink configurations and transformation chains - all running on top of Debezium Server and deployable to Kubernetes with Helm charts.

Following the general industry trend to ease various types of integrations with AI-related workloads, Debezium recently shipped improved vector data type support for sources and added JDBC sink support to write vector embeddings to MySQL and PostgreSQL. Besides, there is a new Debezium Server sink for Milvus and a dedicated vector embeddings transformation.

The Apache Flink Community published a helpful article about building Flink CDC pipelines on Alibaba’s Cloud blog. The first half provides a good first overview before switching to the hands-on part explaining how to build a data pipeline from MySQL to Paimon.

Paper of the Month

Events & Call for Papers (CfP)

DevoxxUK (London, UK) May 7-9
Current 2025 (London, UK) May 20-21
Snowflake Summit 2025 (San Francisco, CA, USA) June 2-5
AI & Big Data Expo (Santa Clara, CA, USA) June 4-5‍
Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP open

New Releases

Debezium 3.1.0.Final and 3.1.1.Final‍
Apache Iceberg 1.7.2 and 1.9.0

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Checkpoint Chronicle - April 2025

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - March 2025

Checkpoint Chronicle - February 2025

Checkpoint Chronicle - January 2025

Table of contents

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - March 2025

Checkpoint Chronicle - February 2025

Checkpoint Chronicle - January 2025

Let's get decoding