Back
July 28, 2025
3
min read

Checkpoint Chronicle - July 2025

Checkpoint Chronicle - July 2025

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • Simon Aubury is known for putting together really entertaining articles about data processing. This time he explores how to use Apache Flink to identify interesting aviation moments like go-arounds and twin landings using real-time aviation data. Project code is here
  • Calvin Tran and Shi Kai Ng share their stream processing journey with FlinkSQL. They explain why their existing Zeppelin notebook-based solution didn’t really cut it any longer and how they migrated to a shared FlinkSQL gateway cluster while focusing only on features that promote data democratisation.

Event Streaming

  • In “Why don't Kafka and Iceberg get along?” Filip Yonov discusses today’s unsatisfactory situation when trying to get queryable Iceberg tables from Kafka topic data. Stressing the example of Diskless Kafka topics he argues that open-source Kafka “must absorb Iceberg the way it absorbed tiered storage”.

Data Ecosystem

  • Fluss passed the vote on June 5th to officially become an ASF incubator project. Read Jark Wu’s post about this significant milestone for Apache Fluss and learn what this means and how to join the project’s open-source community. 
  • Apache Polaris - a REST-based catalog service built to serve the needs of modern, open data lakehouses - recently released version 1.0 shipping several new features and improvements together with some promising experimental features worth exploring. 
  • In “You Don’t Need a Data Owner. Until you do.” Stéphane Derosiaux breaks down what data ownership essentially means, what a data owner actually does, when you need one, and how to address ownership responsibilities without an explicit role or job title.

Data Platforms and Architecture

  • Rajiv Shringi et al. provide insights into the architecture and design behind Netflix's TimeSeries Data Abstraction Layer which integrates with storage backends like Apache Cassandra and Elasticsearch. The article addresses typical challenges such as high throughput needs, efficient querying of large datasets, global read and write operations, tunable configurations, handling bursty traffic, and cost efficiency.
  • Yang Guo published a hands-on article to demonstrate how Apache Fluss enables real-time and historical data unification in a lakehouse architecture. The introductory tutorial walks you through a complete local setup of Flink, Paimon, MinIO, and Fluss and allows you to experience Fluss’s capability to combine real-time streaming ingestion with analytical lakehouse querying first hand.
  • Amaresh Bingumalla et al. from Peloton wrote about how they revamped their data platform by adopting Apache Hudi. They replaced daily PostgreSQL snapshots with CDC ingestion using Debezium and Kafka. Initially doing copy-on-write they soon migrated to a merge-on-read strategy with asynchronous table services for compaction and cleaning and were able to successfully overcome snapshot delays, rigid coupling, and high costs.

Databases and Change Data Capture

  • In case you ever wondered how ClickHouse handles data mutation operations differently than PostgreSQL this video tutorial walkthrough by Mark Needham is for you. You’ll learn how ClickHouse uses versioned inserts with the ReplacingMergeTree engine rather than directly performing UPDATEs and DELETEs.
  • Fiore Mario Vitale put together this practical step by step guide showing how lineage metadata can significantly ease root cause analysis in end-to-end CDC pipelines. The example is based on Debezium to capture PostgreSQL database changes, Apache Flink for stream processing, and OpenLineage with Marquez for lineage tracking and visualization.
  • Gunnar Morling recently wrote a comprehensive guide on mastering PostgreSQL replication slots for Change Data Capture (CDC) pipelines. He shares several best practices including but not limited to: using the pgoutput logical decoding output plug-in, defining a maximum replication slot size, enabling heartbeats, using table-level publications, and enabling fail-over slots.

Paper of the Month

The paper "What’s the Difference? Incremental Processing with Change Queries in Snowflake" by Tyler Akidau et al. presents Snowflake’s approach to querying and consuming incremental changes in database tables via CHANGES queries and STREAM objects. These primitives address the gap in SQL’s ability to natively express table-to-stream conversions, which are foundational for efficient stream processing. The paper outlines the semantics of change capture, including support for append-only and minimum-delta change formats, and elaborates on the use of CHANGES queries in various contexts like views and complex joins.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.

Stream Processing, Streaming SQL, and Streaming Databases

  • Simon Aubury is known for putting together really entertaining articles about data processing. This time he explores how to use Apache Flink to identify interesting aviation moments like go-arounds and twin landings using real-time aviation data. Project code is here
  • Calvin Tran and Shi Kai Ng share their stream processing journey with FlinkSQL. They explain why their existing Zeppelin notebook-based solution didn’t really cut it any longer and how they migrated to a shared FlinkSQL gateway cluster while focusing only on features that promote data democratisation.

Event Streaming

  • In “Why don't Kafka and Iceberg get along?” Filip Yonov discusses today’s unsatisfactory situation when trying to get queryable Iceberg tables from Kafka topic data. Stressing the example of Diskless Kafka topics he argues that open-source Kafka “must absorb Iceberg the way it absorbed tiered storage”.

Data Ecosystem

  • Fluss passed the vote on June 5th to officially become an ASF incubator project. Read Jark Wu’s post about this significant milestone for Apache Fluss and learn what this means and how to join the project’s open-source community. 
  • Apache Polaris - a REST-based catalog service built to serve the needs of modern, open data lakehouses - recently released version 1.0 shipping several new features and improvements together with some promising experimental features worth exploring. 
  • In “You Don’t Need a Data Owner. Until you do.” Stéphane Derosiaux breaks down what data ownership essentially means, what a data owner actually does, when you need one, and how to address ownership responsibilities without an explicit role or job title.

Data Platforms and Architecture

  • Rajiv Shringi et al. provide insights into the architecture and design behind Netflix's TimeSeries Data Abstraction Layer which integrates with storage backends like Apache Cassandra and Elasticsearch. The article addresses typical challenges such as high throughput needs, efficient querying of large datasets, global read and write operations, tunable configurations, handling bursty traffic, and cost efficiency.
  • Yang Guo published a hands-on article to demonstrate how Apache Fluss enables real-time and historical data unification in a lakehouse architecture. The introductory tutorial walks you through a complete local setup of Flink, Paimon, MinIO, and Fluss and allows you to experience Fluss’s capability to combine real-time streaming ingestion with analytical lakehouse querying first hand.
  • Amaresh Bingumalla et al. from Peloton wrote about how they revamped their data platform by adopting Apache Hudi. They replaced daily PostgreSQL snapshots with CDC ingestion using Debezium and Kafka. Initially doing copy-on-write they soon migrated to a merge-on-read strategy with asynchronous table services for compaction and cleaning and were able to successfully overcome snapshot delays, rigid coupling, and high costs.

Databases and Change Data Capture

  • In case you ever wondered how ClickHouse handles data mutation operations differently than PostgreSQL this video tutorial walkthrough by Mark Needham is for you. You’ll learn how ClickHouse uses versioned inserts with the ReplacingMergeTree engine rather than directly performing UPDATEs and DELETEs.
  • Fiore Mario Vitale put together this practical step by step guide showing how lineage metadata can significantly ease root cause analysis in end-to-end CDC pipelines. The example is based on Debezium to capture PostgreSQL database changes, Apache Flink for stream processing, and OpenLineage with Marquez for lineage tracking and visualization.
  • Gunnar Morling recently wrote a comprehensive guide on mastering PostgreSQL replication slots for Change Data Capture (CDC) pipelines. He shares several best practices including but not limited to: using the pgoutput logical decoding output plug-in, defining a maximum replication slot size, enabling heartbeats, using table-level publications, and enabling fail-over slots.

Paper of the Month

The paper "What’s the Difference? Incremental Processing with Change Queries in Snowflake" by Tyler Akidau et al. presents Snowflake’s approach to querying and consuming incremental changes in database tables via CHANGES queries and STREAM objects. These primitives address the gap in SQL’s ability to natively express table-to-stream conversions, which are foundational for efficient stream processing. The paper outlines the semantics of change capture, including support for append-only and minimum-delta change formats, and elaborates on the use of CHANGES queries in various contexts like views and complex joins.

Events & Call for Papers (CfP)

New Releases

That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.

Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Hans-Peter Grahsl

Hans-Peter Grahsl is a Staff Developer Advocate at Decodable. He is an open-source community enthusiast and in particular passionate about event-driven architectures, distributed stream processing systems and data engineering. For his code contributions, conference talks and blog post writing at the intersection of the Apache Kafka and MongoDB communities, Hans-Peter received multiple community awards. He likes to code and is a regular speaker at developer conferences around the world.

Let's get decoding

Decodable is free. No CC required. Never expires.