Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Simon Aubury is known for putting together really entertaining articles about data processing. This time he explores how to use Apache Flink to identify interesting aviation moments like go-arounds and twin landings using real-time aviation data. Project code is here.
- André Santos open-sourced an Apache Flink HTTP Full Cache Connector to efficiently enrich infrequently changing reference data via web APIs. Read the article behind this effort.
- Calvin Tran and Shi Kai Ng share their stream processing journey with FlinkSQL. They explain why their existing Zeppelin notebook-based solution didn’t really cut it any longer and how they migrated to a shared FlinkSQL gateway cluster while focusing only on features that promote data democratisation.
Event Streaming
- Alexis Souquiere wrote about the experiences of Michelin’s platform team with using Kestra - a declarative workflow orchestration platform - to help teams migrate their Kafka Connect workloads.
- In “Why don't Kafka and Iceberg get along?” Filip Yonov discusses today’s unsatisfactory situation when trying to get queryable Iceberg tables from Kafka topic data. Stressing the example of Diskless Kafka topics he argues that open-source Kafka “must absorb Iceberg the way it absorbed tiered storage”.
Data Ecosystem
- Fluss passed the vote on June 5th to officially become an ASF incubator project. Read Jark Wu’s post about this significant milestone for Apache Fluss and learn what this means and how to join the project’s open-source community.
- Apache Polaris - a REST-based catalog service built to serve the needs of modern, open data lakehouses - recently released version 1.0 shipping several new features and improvements together with some promising experimental features worth exploring.
- Dash Desai published an article showing how to embrace open lakehouse architectures by moving from Delta Lake to Apache Iceberg using Snowflake’s Delta Direct as a simplified approach.
- In “You Don’t Need a Data Owner. Until you do.” Stéphane Derosiaux breaks down what data ownership essentially means, what a data owner actually does, when you need one, and how to address ownership responsibilities without an explicit role or job title.
Data Platforms and Architecture
- Rajiv Shringi et al. provide insights into the architecture and design behind Netflix's TimeSeries Data Abstraction Layer which integrates with storage backends like Apache Cassandra and Elasticsearch. The article addresses typical challenges such as high throughput needs, efficient querying of large datasets, global read and write operations, tunable configurations, handling bursty traffic, and cost efficiency.
- Yang Guo published a hands-on article to demonstrate how Apache Fluss enables real-time and historical data unification in a lakehouse architecture. The introductory tutorial walks you through a complete local setup of Flink, Paimon, MinIO, and Fluss and allows you to experience Fluss’s capability to combine real-time streaming ingestion with analytical lakehouse querying first hand.
- Amaresh Bingumalla et al. from Peloton wrote about how they revamped their data platform by adopting Apache Hudi. They replaced daily PostgreSQL snapshots with CDC ingestion using Debezium and Kafka. Initially doing copy-on-write they soon migrated to a merge-on-read strategy with asynchronous table services for compaction and cleaning and were able to successfully overcome snapshot delays, rigid coupling, and high costs.
Databases and Change Data Capture
- Chris Cranford's dives into an infamous error - Online REDO LOG files or archive log files do not contain the offset SCN - you might face when working with the Debezium Oracle connector. Besides explaining the common causes, Chris shares several useful mitigation techniques on how to better cope with this in practice.
- In case you ever wondered how ClickHouse handles data mutation operations differently than PostgreSQL this video tutorial walkthrough by Mark Needham is for you. You’ll learn how ClickHouse uses versioned inserts with the ReplacingMergeTree engine rather than directly performing UPDATEs and DELETEs.
- Fiore Mario Vitale put together this practical step by step guide showing how lineage metadata can significantly ease root cause analysis in end-to-end CDC pipelines. The example is based on Debezium to capture PostgreSQL database changes, Apache Flink for stream processing, and OpenLineage with Marquez for lineage tracking and visualization.
- Gunnar Morling recently wrote a comprehensive guide on mastering PostgreSQL replication slots for Change Data Capture (CDC) pipelines. He shares several best practices including but not limited to: using the pgoutput logical decoding output plug-in, defining a maximum replication slot size, enabling heartbeats, using table-level publications, and enabling fail-over slots.
Paper of the Month
The paper "What’s the Difference? Incremental Processing with Change Queries in Snowflake" by Tyler Akidau et al. presents Snowflake’s approach to querying and consuming incremental changes in database tables via CHANGES queries and STREAM objects. These primitives address the gap in SQL’s ability to natively express table-to-stream conversions, which are foundational for efficient stream processing. The paper outlines the semantics of change capture, including support for append-only and minimum-delta change formats, and elaborates on the use of CHANGES queries in various contexts like views and complex joins.
Events & Call for Papers (CfP)
- JavaZone 2025 (Lillestrøm, Norway) September 4-5
- BigDataLdn 2025 (London, United Kingdom) September 24-25
- Data Streaming Summit (San Francisco, CA, USA) September 29-30
- Devoxx (Antwerp, Belgium) October 6-10
- Flink Forward 2025 (Barcelona, Spain) October 13-16
- Current 2025 (New Orleans, LA, USA) October 29-30
- MQ Summit (Berlin, Germany) November 6
New Releases
- Apache Flink 1.20.2 and 1.19.3
- Apache Flink Kubernetes Operator patch release 1.12.1
- Debezium 3.2.0.Final
- Apache Iceberg 1.9.2
- Apache Polaris 1.0.0
- Strimzi 0.47.0
—
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)