Back
May 28, 2025
5
min read

Checkpoint Chronicle - May 2025

Checkpoint Chronicle - May 2025

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • This community article published on Alibaba’s blog is based on a recent Flink Forward Asia 2024 talk and explains the materialized table feature in Flink 2.0 which allows users to define data freshness and query logic declaratively and is designed to unify streaming and batch workloads even further.
  • JOINs in Flink streaming jobs can lead to ever increasing state size and a few other related challenges. Work on FLIP-486—which suggests a new Delta Join to address this problem—has started and the corresponding ticket lists Flink 2.1.0 as the target version.
  • There’s more than just one way to ingest CDC events from Kafka topics into Flink for SQL-based stream processing. Gunnar Morling nicely covers the different options and discusses when to use which in one of his latest blog posts.

Event Streaming

  • In "The various tiers of Apache Kafka Tiered Storage" Jakub Scholz explores remote storage options for Kafka by showing choices other than just going with S3 directly. He discusses a Strimzi-based example that uses shared NFS storage and Aiven's tiered storage plugin.
  • Two interesting KIPs caught my attention:
    • KIP-1159 (under discussion) is about introducing a special serializer which implements reference-based messaging (a.k.a the claim check pattern) to externalize large payloads into other storage systems rather than writing them into Kafka topics directly.
    • KIP-1182 (draft) suggests the definition and implementation of a vendor-neutral quality of service (QoS) framework which should allow Kafka to serve more diverse workloads.
  • Kafka Schema Registry Migrator is a brand new open-source tool which aims to simplify the process of managing schemas across multiple different registry instances. Read more in Roman Melnyk’s inaugural blog post and check out the GitHub repo.

Data Ecosystem

  • Apache Iceberg 1.9.0 came out last month and Snowflake’s Danica Fine walks you through all the goodness of the new version. Get up to speed in about 7 min by watching her release video.
  • Amit Gilad recently wrote an insightful article about the pros and cons of the primary compaction strategies (Sort vs. Binpack) in Apache Iceberg and why it can make sense to adopt a hybrid approach.
  • Lake Loader is a new tool designed to benchmark incremental write loads to data lakes and warehouses according to configurable load patterns. It's built on top of Apache Spark and can ingest into popular open table formats.

Data Platforms and Architecture

  • Kinesh Satiya shared a behind-the-scenes perspective on how Netflix is "Building a Robust Ads Event Processing Pipeline". It's an interesting read providing high-level insights on evolving large-scale systems in the field during multi-quarter efforts across different engineering teams.

RDBMS and Change Data Capture

  • Not everyone is intimately familiar with change data capture which is why Kirill Bobrov wrote this nice intro article to motivate the topic, briefly discuss different alternatives and explain some benefits of a log-based CDC approach.
  • Pravish Sood blogged how Squarespace used Debezium to tackle a large-scale database migration from PostgreSQL to CockroachDB with snapshot speeds of up to 60k rows per second.

Paper of the Month

Jacopo Tagliabue et al. wrote "FaaS and Furious: abstractions and differential caching for efficient data pre-processing", a short paper introducing Bauplan which is a data lakehouse platform for running queries and declarative pipelines, comprising an Iceberg-compatible data catalog, a data-aware Function-as-a-Service runtime, and a set of abstractions for DAGs. It allows users to express data transformations in SQL or with Python code.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • This community article published on Alibaba’s blog is based on a recent Flink Forward Asia 2024 talk and explains the materialized table feature in Flink 2.0 which allows users to define data freshness and query logic declaratively and is designed to unify streaming and batch workloads even further.
  • JOINs in Flink streaming jobs can lead to ever increasing state size and a few other related challenges. Work on FLIP-486—which suggests a new Delta Join to address this problem—has started and the corresponding ticket lists Flink 2.1.0 as the target version.
  • There’s more than just one way to ingest CDC events from Kafka topics into Flink for SQL-based stream processing. Gunnar Morling nicely covers the different options and discusses when to use which in one of his latest blog posts.

Event Streaming

  • In "The various tiers of Apache Kafka Tiered Storage" Jakub Scholz explores remote storage options for Kafka by showing choices other than just going with S3 directly. He discusses a Strimzi-based example that uses shared NFS storage and Aiven's tiered storage plugin.
  • Two interesting KIPs caught my attention:
    • KIP-1159 (under discussion) is about introducing a special serializer which implements reference-based messaging (a.k.a the claim check pattern) to externalize large payloads into other storage systems rather than writing them into Kafka topics directly.
    • KIP-1182 (draft) suggests the definition and implementation of a vendor-neutral quality of service (QoS) framework which should allow Kafka to serve more diverse workloads.
  • Kafka Schema Registry Migrator is a brand new open-source tool which aims to simplify the process of managing schemas across multiple different registry instances. Read more in Roman Melnyk’s inaugural blog post and check out the GitHub repo.

Data Ecosystem

  • Apache Iceberg 1.9.0 came out last month and Snowflake’s Danica Fine walks you through all the goodness of the new version. Get up to speed in about 7 min by watching her release video.
  • Amit Gilad recently wrote an insightful article about the pros and cons of the primary compaction strategies (Sort vs. Binpack) in Apache Iceberg and why it can make sense to adopt a hybrid approach.
  • Lake Loader is a new tool designed to benchmark incremental write loads to data lakes and warehouses according to configurable load patterns. It's built on top of Apache Spark and can ingest into popular open table formats.

Data Platforms and Architecture

  • Kinesh Satiya shared a behind-the-scenes perspective on how Netflix is "Building a Robust Ads Event Processing Pipeline". It's an interesting read providing high-level insights on evolving large-scale systems in the field during multi-quarter efforts across different engineering teams.

RDBMS and Change Data Capture

  • Not everyone is intimately familiar with change data capture which is why Kirill Bobrov wrote this nice intro article to motivate the topic, briefly discuss different alternatives and explain some benefits of a log-based CDC approach.
  • Pravish Sood blogged how Squarespace used Debezium to tackle a large-scale database migration from PostgreSQL to CockroachDB with snapshot speeds of up to 60k rows per second.

Paper of the Month

Jacopo Tagliabue et al. wrote "FaaS and Furious: abstractions and differential caching for efficient data pre-processing", a short paper introducing Bauplan which is a data lakehouse platform for running queries and declarative pipelines, comprising an Iceberg-compatible data catalog, a data-aware Function-as-a-Service runtime, and a set of abstractions for DAGs. It allows users to express data transformations in SQL or with Python code.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

Let's get decoding

Decodable is free. No CC required. Never expires.