Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- This community article published on Alibaba’s blog is based on a recent Flink Forward Asia 2024 talk and explains the materialized table feature in Flink 2.0 which allows users to define data freshness and query logic declaratively and is designed to unify streaming and batch workloads even further.
- Intense but in style… Robin Moffat wrote about joins and changelogs in Flink SQL. There’s a lot to learn so you might wanna grab a coffee or two :)
- JOINs in Flink streaming jobs can lead to ever increasing state size and a few other related challenges. Work on FLIP-486—which suggests a new Delta Join to address this problem—has started and the corresponding ticket lists Flink 2.1.0 as the target version.
- There’s more than just one way to ingest CDC events from Kafka topics into Flink for SQL-based stream processing. Gunnar Morling nicely covers the different options and discusses when to use which in one of his latest blog posts.
Event Streaming
- In "The various tiers of Apache Kafka Tiered Storage" Jakub Scholz explores remote storage options for Kafka by showing choices other than just going with S3 directly. He discusses a Strimzi-based example that uses shared NFS storage and Aiven's tiered storage plugin.
- Talking about Strimzi, version 0.46 has been released earlier this month and this 6 min video walks you through its main new features.
- Two interesting KIPs caught my attention:
- KIP-1159 (under discussion) is about introducing a special serializer which implements reference-based messaging (a.k.a the claim check pattern) to externalize large payloads into other storage systems rather than writing them into Kafka topics directly.
- KIP-1182 (draft) suggests the definition and implementation of a vendor-neutral quality of service (QoS) framework which should allow Kafka to serve more diverse workloads.
- Kafka Schema Registry Migrator is a brand new open-source tool which aims to simplify the process of managing schemas across multiple different registry instances. Read more in Roman Melnyk’s inaugural blog post and check out the GitHub repo.
Data Ecosystem
- Apache Iceberg 1.9.0 came out last month and Snowflake’s Danica Fine walks you through all the goodness of the new version. Get up to speed in about 7 min by watching her release video.
- Amit Gilad recently wrote an insightful article about the pros and cons of the primary compaction strategies (Sort vs. Binpack) in Apache Iceberg and why it can make sense to adopt a hybrid approach.
- Lake Loader is a new tool designed to benchmark incremental write loads to data lakes and warehouses according to configurable load patterns. It's built on top of Apache Spark and can ingest into popular open table formats.
- Glauber Costa shares how Turso fully reworked their storage system allowing them to operate entirely on S3 rather than relying on local disks and why this is beneficial in particular for BYOC deployments.
- Yaroslav Tkachenko recently announced his solopreneuship project called Irontools which currently offers two commercial Apache Flink extensions. Iron Serde acts as a high-performance drop-in replacement for Kafka SerDes focusing on Avro and JSON. Iron Functions enables Flink UDFs written in previously unsupported languages via WebAssembly.
Data Platforms and Architecture
- Tristan Culp & Gaurav Sharma from DoorDash explained why and how they are using Flink with Iceberg in their Iceberg Summit 2025 talk. If you prefer reading, Vu Trinh wrote a nice article summarizing this presentation.
- Kinesh Satiya shared a behind-the-scenes perspective on how Netflix is "Building a Robust Ads Event Processing Pipeline". It's an interesting read providing high-level insights on evolving large-scale systems in the field during multi-quarter efforts across different engineering teams.
RDBMS and Change Data Capture
- Not everyone is intimately familiar with change data capture which is why Kirill Bobrov wrote this nice intro article to motivate the topic, briefly discuss different alternatives and explain some benefits of a log-based CDC approach.
- Pravish Sood blogged how Squarespace used Debezium to tackle a large-scale database migration from PostgreSQL to CockroachDB with snapshot speeds of up to 60k rows per second.
- In case you missed it, the Debezium team will be transitioning from Red Hat to IBM this July, a move which is primarily about joint innovation in the Java ecosystem. Read more background and details if interested.
- Gwen Shapira recently posted a really helpful summary showing which operations cause a table rewrite in Postgres. Additionally, there is a continuously updated and slightly more detailed tabular overview about this by Robins Tharakan.
Paper of the Month
Jacopo Tagliabue et al. wrote "FaaS and Furious: abstractions and differential caching for efficient data pre-processing", a short paper introducing Bauplan which is a data lakehouse platform for running queries and declarative pipelines, comprising an Iceberg-compatible data catalog, a data-aware Function-as-a-Service runtime, and a set of abstractions for DAGs. It allows users to express data transformations in SQL or with Python code.
Events & Call for Papers (CfP)
- Snowflake Summit 2025 (San Francisco, CA, USA) June 2-5
- AI & Big Data Expo (Santa Clara, CA, USA) June 4-5
- Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP open
- Current 2025 (New Orleans, LA, USA) October 29-30, CfP open
New Releases
- Flink CDC 3.4.0
- Apache Kafka 3.9.1
- Strimzi 0.46
- librdkafka v2.10.0
- Debezium 3.2.0.Alpha1
- Kroxylicious 0.12.0
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)
‍