Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- If you’ve always been curious what watermarks are in Apache Flink, here is your chance to learn about them in almost no time by reading Robin Moffatt’s article.‍
- Flink SQL Runner is a curated set of tools and extensions to help run Apache Flink SQL applications on top of the Flink Kubernetes Operator. ‍
- Tributary is a DuckDB extension, addressing data engineers and analysts alike, which provides a seamless integration between Apache Kafka and DuckDB for real-time querying and analysis of streaming data using SQL.
Event Streaming
- Strimzi is not only a well-known and widely deployed project in the Kafka space, it even has its own virtual conference. In case you missed StrimziCon 2025, here is a quick session recap with links to all recordings and slides.‍
- Kroxylicious, the snappy open source proxy for Apache Kafka, recently shipped version 0.13.0 which includes an operator to run the proxy on Kubernetes.
- Apache Avro is a popular serialization format for Kafka records. To remove some friction when working with Avro data in the context of CLI tooling Dale Lane open sourced kafka-avro-formatters.
- Even though Apache Kafka is around for several years, there is still demand for beginners’ content. Here is one such article recently written by Vu Trinh.
- In “Kafka: The End of the Beginning”, Chris Riccomini reflects on the last decade in the event streaming and stream processing spaces. Unsurprisingly, there are a few spicy takes in there which might justify bringing popcorn ;-)
Data Ecosystem
- MLflow 3 - released in early June - extends MLflow’s foundation to address the challenging requirements of generative AI workloads, in particular how to measure and ensure quality and stability.‍
- Apache Iceberg enthusiasts have been eagerly awaiting this moment… the ratification of the v3 table spec. Read more in this blog article by Danica Fine and Kevin Liu who briefly walk you through the main features and share what this community-driven effort means going forward.‍
- Fluss - the streaming storage for real-time analytics - officially released version 0.7 lately. Besides the announcement post there is a webinar recording to dive deeper into all its new features and improvements.
Data Platforms and Architecture
- Kumudini Kakwani et al. published an article explaining Uber’s migration journey from Hive to Apache Spark SQL. Besides some impressive numbers, it contains several helpful insights into how they successfully tackled various difficult challenges along the way.
- “Model Once, Represent Everywhere: UDA (Unified Data Architecture)” written by Alex Hutter et al. details how Netflix automatically transpiles their domain models into consistent schemas to preserve integrity and interoperability across federated data systems.
Databases and Change Data Capture
- In “The Art of SQL Query Optimization” Jan Nidzwetzki introduces Plan Explorer, a tool which provides valuable insights into the workings of Postgres query optimizations.
- Bohan Zhang’s PGConf.dev 2025 talk “OpenAI: Scaling PostgreSQL to the Next Level” discusses vital aspects and interesting techniques to achieve their reliability and scalability needs for critical workloads.
- Niko Matsakis and Marc Bowes offer insights into Amazon’s DSQL development and why Rust turned out to be a great fit for them. Read the details in “Just make it scale: An Aurora DSQL story”.
- ‍CockroachDB includes changefeeds as a native database feature. Rohan Joshi and Miles Frankel wrote “Enriched Changefeeds: Debezium Simplicity, CockroachDB Resilience” to explain why they decided to adopt Debezium’s change event stream format.
- Snyk created Skemium, an open source tool which helps to detect breaking schema changes of CDC events as early as possible. By comparing between evolutions of the originating database schema, it identifies compatibility issues when executing the schema comparison logic implemented by the schema registry.
- Fiore Mario Vitale discusses how Debezium natively integrates with OpenLineage to help answer critical data lineage related questions. The article also touches upon Marquez as an example of how to process and work with lineage data in the context of CDC.
Paper of the Month
The research paper by Alexander Behm et al. describes Photon, a vectorized query engine built for Lakehouse systems. Photon is implemented in C++ and tightly integrates with Apache Spark APIs to support both SQL and DataFrame-based workloads. It tackles two core challenges: performance over raw, uncurated datasets and semantic compatibility. Photon delivers average query speedups of ~3x (up to 10x) compared to legacy Spark runtimes, and enabled a 100 TB TPC‑DS world record on a Delta‑Lake/S3 Lakehouse.
Events & Call for Papers (CfP)
- BEAM Summit 2025 (New York City, USA) July 8-9
- JavaZone 2025 (Lillestrøm, Norway) September 4-5
- BigDataLdn 2025 (London, United Kingdom) September 24-25
- Devoxx (Antwerp, Belgium) October 6-10, CfP open
- Flink Forward 2025 (Barcelona, Spain) October 13-16
- Current 2025 (New Orleans, LA, USA) October 29-30
New Releases
- Apache Flink Kubernetes Operator 1.12.0
- Debezium 3.1.3.Final and 3.2.0.CR1
- Apache Iceberg 1.9.1
- Apache Pulsar 4.0.5, 3.3.7, and 3.0.12
- Fluss 0.7
- Strimzi 0.46.1
- Kroxylicious 0.13.0
—
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.
Hans-Peter (LinkedIn / Bluesky / X / Mastodon / Email)