🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

May 28, 2025

min read

Checkpoint Chronicle - May 2025

Share this post

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

This community article published on Alibaba’s blog is based on a recent Flink Forward Asia 2024 talk and explains the materialized table feature in Flink 2.0 which allows users to define data freshness and query logic declaratively and is designed to unify streaming and batch workloads even further.

Intense but in style… Robin Moffat wrote about joins and changelogs in Flink SQL. There’s a lot to learn so you might wanna grab a coffee or two :)

JOINs in Flink streaming jobs can lead to ever increasing state size and a few other related challenges. Work on FLIP-486—which suggests a new Delta Join to address this problem—has started and the corresponding ticket lists Flink 2.1.0 as the target version.

There’s more than just one way to ingest CDC events from Kafka topics into Flink for SQL-based stream processing. Gunnar Morling nicely covers the different options and discusses when to use which in one of his latest blog posts.

Event Streaming

In "The various tiers of Apache Kafka Tiered Storage" Jakub Scholz explores remote storage options for Kafka by showing choices other than just going with S3 directly. He discusses a Strimzi-based example that uses shared NFS storage and Aiven's tiered storage plugin.

Talking about Strimzi, version 0.46 has been released earlier this month and this 6 min video walks you through its main new features.

Two interesting KIPs caught my attention:
- KIP-1159 (under discussion) is about introducing a special serializer which implements reference-based messaging (a.k.a the claim check pattern) to externalize large payloads into other storage systems rather than writing them into Kafka topics directly.
- KIP-1182 (draft) suggests the definition and implementation of a vendor-neutral quality of service (QoS) framework which should allow Kafka to serve more diverse workloads.
Kafka Schema Registry Migrator is a brand new open-source tool which aims to simplify the process of managing schemas across multiple different registry instances. Read more in Roman Melnyk’s inaugural blog post and check out the GitHub repo.

Data Ecosystem

Apache Iceberg 1.9.0 came out last month and Snowflake’s Danica Fine walks you through all the goodness of the new version. Get up to speed in about 7 min by watching her release video.

Amit Gilad recently wrote an insightful article about the pros and cons of the primary compaction strategies (Sort vs. Binpack) in Apache Iceberg and why it can make sense to adopt a hybrid approach.

Lake Loader is a new tool designed to benchmark incremental write loads to data lakes and warehouses according to configurable load patterns. It's built on top of Apache Spark and can ingest into popular open table formats.

Glauber Costa shares how Turso fully reworked their storage system allowing them to operate entirely on S3 rather than relying on local disks and why this is beneficial in particular for BYOC deployments.

Yaroslav Tkachenko recently announced his solopreneuship project called Irontools which currently offers two commercial Apache Flink extensions. Iron Serde acts as a high-performance drop-in replacement for Kafka SerDes focusing on Avro and JSON. Iron Functions enables Flink UDFs written in previously unsupported languages via WebAssembly.

Data Platforms and Architecture

Tristan Culp & Gaurav Sharma from DoorDash explained why and how they are using Flink with Iceberg in their Iceberg Summit 2025 talk. If you prefer reading, Vu Trinh wrote a nice article summarizing this presentation.

Kinesh Satiya shared a behind-the-scenes perspective on how Netflix is "Building a Robust Ads Event Processing Pipeline". It's an interesting read providing high-level insights on evolving large-scale systems in the field during multi-quarter efforts across different engineering teams.

RDBMS and Change Data Capture

Not everyone is intimately familiar with change data capture which is why Kirill Bobrov wrote this nice intro article to motivate the topic, briefly discuss different alternatives and explain some benefits of a log-based CDC approach.

Pravish Sood blogged how Squarespace used Debezium to tackle a large-scale database migration from PostgreSQL to CockroachDB with snapshot speeds of up to 60k rows per second.

In case you missed it, the Debezium team will be transitioning from Red Hat to IBM this July, a move which is primarily about joint innovation in the Java ecosystem. Read more background and details if interested.

Gwen Shapira recently posted a really helpful summary showing which operations cause a table rewrite in Postgres. Additionally, there is a continuously updated and slightly more detailed tabular overview about this by Robins Tharakan.

Paper of the Month

Jacopo Tagliabue et al. wrote "FaaS and Furious: abstractions and differential caching for efficient data pre-processing", a short paper introducing Bauplan which is a data lakehouse platform for running queries and declarative pipelines, comprising an Iceberg-compatible data catalog, a data-aware Function-as-a-Service runtime, and a set of abstractions for DAGs. It allows users to express data transformations in SQL or with Python code.

Events & Call for Papers (CfP)

Snowflake Summit 2025 (San Francisco, CA, USA) June 2-5
AI & Big Data Expo (Santa Clara, CA, USA) June 4-5
Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP open
Current 2025 (New Orleans, LA, USA) October 29-30, CfP open

New Releases

Flink CDC 3.4.0
Apache Kafka 3.9.1
Strimzi 0.46
librdkafka v2.10.0
Debezium 3.2.0.Alpha1
Kroxylicious 0.12.0

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

April 30, 2025

min read

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start FREE Talk to an Expert

Heading 2

Stream Processing, Streaming SQL, and Streaming Databases

This community article published on Alibaba’s blog is based on a recent Flink Forward Asia 2024 talk and explains the materialized table feature in Flink 2.0 which allows users to define data freshness and query logic declaratively and is designed to unify streaming and batch workloads even further.

Intense but in style… Robin Moffat wrote about joins and changelogs in Flink SQL. There’s a lot to learn so you might wanna grab a coffee or two :)

JOINs in Flink streaming jobs can lead to ever increasing state size and a few other related challenges. Work on FLIP-486—which suggests a new Delta Join to address this problem—has started and the corresponding ticket lists Flink 2.1.0 as the target version.

There’s more than just one way to ingest CDC events from Kafka topics into Flink for SQL-based stream processing. Gunnar Morling nicely covers the different options and discusses when to use which in one of his latest blog posts.

Event Streaming

In "The various tiers of Apache Kafka Tiered Storage" Jakub Scholz explores remote storage options for Kafka by showing choices other than just going with S3 directly. He discusses a Strimzi-based example that uses shared NFS storage and Aiven's tiered storage plugin.

Talking about Strimzi, version 0.46 has been released earlier this month and this 6 min video walks you through its main new features.

Two interesting KIPs caught my attention:
- KIP-1159 (under discussion) is about introducing a special serializer which implements reference-based messaging (a.k.a the claim check pattern) to externalize large payloads into other storage systems rather than writing them into Kafka topics directly.
- KIP-1182 (draft) suggests the definition and implementation of a vendor-neutral quality of service (QoS) framework which should allow Kafka to serve more diverse workloads.
Kafka Schema Registry Migrator is a brand new open-source tool which aims to simplify the process of managing schemas across multiple different registry instances. Read more in Roman Melnyk’s inaugural blog post and check out the GitHub repo.

Data Ecosystem

Apache Iceberg 1.9.0 came out last month and Snowflake’s Danica Fine walks you through all the goodness of the new version. Get up to speed in about 7 min by watching her release video.

Amit Gilad recently wrote an insightful article about the pros and cons of the primary compaction strategies (Sort vs. Binpack) in Apache Iceberg and why it can make sense to adopt a hybrid approach.

Lake Loader is a new tool designed to benchmark incremental write loads to data lakes and warehouses according to configurable load patterns. It's built on top of Apache Spark and can ingest into popular open table formats.

Glauber Costa shares how Turso fully reworked their storage system allowing them to operate entirely on S3 rather than relying on local disks and why this is beneficial in particular for BYOC deployments.

Yaroslav Tkachenko recently announced his solopreneuship project called Irontools which currently offers two commercial Apache Flink extensions. Iron Serde acts as a high-performance drop-in replacement for Kafka SerDes focusing on Avro and JSON. Iron Functions enables Flink UDFs written in previously unsupported languages via WebAssembly.

Data Platforms and Architecture

Tristan Culp & Gaurav Sharma from DoorDash explained why and how they are using Flink with Iceberg in their Iceberg Summit 2025 talk. If you prefer reading, Vu Trinh wrote a nice article summarizing this presentation.

Kinesh Satiya shared a behind-the-scenes perspective on how Netflix is "Building a Robust Ads Event Processing Pipeline". It's an interesting read providing high-level insights on evolving large-scale systems in the field during multi-quarter efforts across different engineering teams.

RDBMS and Change Data Capture

Not everyone is intimately familiar with change data capture which is why Kirill Bobrov wrote this nice intro article to motivate the topic, briefly discuss different alternatives and explain some benefits of a log-based CDC approach.

Pravish Sood blogged how Squarespace used Debezium to tackle a large-scale database migration from PostgreSQL to CockroachDB with snapshot speeds of up to 60k rows per second.

In case you missed it, the Debezium team will be transitioning from Red Hat to IBM this July, a move which is primarily about joint innovation in the Java ecosystem. Read more background and details if interested.

Gwen Shapira recently posted a really helpful summary showing which operations cause a table rewrite in Postgres. Additionally, there is a continuously updated and slightly more detailed tabular overview about this by Robins Tharakan.

Paper of the Month

Events & Call for Papers (CfP)

Snowflake Summit 2025 (San Francisco, CA, USA) June 2-5
AI & Big Data Expo (Santa Clara, CA, USA) June 4-5
Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP open
Current 2025 (New Orleans, LA, USA) October 29-30, CfP open

New Releases

Flink CDC 3.4.0
Apache Kafka 3.9.1
Strimzi 0.46
librdkafka v2.10.0
Debezium 3.2.0.Alpha1
Kroxylicious 0.12.0

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Checkpoint Chronicle - May 2025

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - April 2025

Checkpoint Chronicle - March 2025

Checkpoint Chronicle - February 2025

Table of contents

Stream Processing, Streaming SQL, and Streaming Databases

Event Streaming

Data Ecosystem

Data Platforms and Architecture

RDBMS and Change Data Capture

Paper of the Month

Events & Call for Papers (CfP)

New Releases

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Checkpoint Chronicle - April 2025

Checkpoint Chronicle - March 2025

Checkpoint Chronicle - February 2025

Let's get decoding