Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your editor-in-chief for this edition is Hans-Peter Grahsl. Feel free to send me any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Fraud detection is an often mentioned use case for Apache Flink. Shriram Ravichandran and Dharmateja Yarlagadda wrote two blog posts showing how to implement Flink jobs to detect suspicious transactions based on basic rules (part 1) or complex pattern matching (part 2).
- Yaroslav Tkachenko started an article series to reflect on selected challenges in modern data streaming systems. Using Apache Flink as an example, part one elaborates on various efficiency related aspects while part two puts developer experience into the focus.
- My recent article explains how to improve the overall testability of Apache Flink jobs. More specifically, it addresses how to (re)write (existing) Flink jobs in a more modular way by defining job-specific interfaces for the actual processing logic while at the same time making source and sink components pluggable.
Event Streaming
- KIP-1150: Diskless Topics was added recently, with the primary aim to redirect replication from broker disks to cloud object storage, thereby significantly reducing total operational costs for Kafka clusters. Filip Yonov and Josep Prat wrote this article discussing several insightful aspects behind this idea.
- Related to KIP-1150, Gunnar Morling blogged about “What If We Could Rebuild Kafka From Scratch?”. In this thought experiment, he comes up with his personal nine-items-wishlist for what a potential “Kafka.next” should look like.
- KPipe is a new project providing a functional and high-performance Kafka consumer implementation built on top of modern Java features. It uses virtual threads, composable message processors, and DslJson.
Data Ecosystem
- Ever wondered what makes Apache Arrow a good choice as a data interchange format for databases and query engines? Find out in this two part article series by Ian Cook, David Lee, and Matt Topol who discuss “How the Apache Arrow Format Accelerates Query Result Transfer” and “Data Wants to Be Free: Fast Data Exchange with Apache Arrow”.
- With “How I’d Learn Apache Iceberg (if I Had To Start Over)”, Dunith Danushka shares a 7-week study plan for getting started with the extremely popular Apache Iceberg project.
- Chris Riccomini recently wrote an insightful post on incremental view maintenance (IVM) and highlights its growing adoption in platforms like Materialize, PostgreSQL's pg_ivm extension, and tools such as Epsio or Feldera. He discusses key research contributions, including timely / differential dataflow, and DBSP, all of which have influenced the development of efficient IVM systems.
Data Platforms and Architecture
- Vu Trinh provides a concise overview of how Meta modernized their lakehouse across their whole organization with the three primary goals being engineering efficiency, faster innovation, and better user experience. The TL;DR is that they moved away from fragmented query engines and different dialects towards a unified execution engine and standard SQL.
- SeungMin Lee concluded a blog series and wrote about their “Iceberg Operation Journey: Takeaways for DB & Server Logs”. This third article shares the chosen configuration settings, their experiences with partitioning strategies, and how to monitor Iceberg-related metrics when loading different types of logs into Iceberg tables.
RDBMS and Change Data Capture
- I recently stumbled upon drawDB, a really convenient in-browser editor for doing database modeling and SQL generation. Check out the GitHub repo.
- Earlier this year, Daniil Roman shared a nice real-world use case explaining how their team used Debezium and Apache Kafka in a recent data migration project.
- Debezium 3.1.0.Final is absolutely packed with good stuff. Most notably, this release ships Debezium Platform (short video walkthrough) which offers a modern and opinionated way to fully describe data pipelines - source/sink configurations and transformation chains - all running on top of Debezium Server and deployable to Kubernetes with Helm charts.
- Following the general industry trend to ease various types of integrations with AI-related workloads, Debezium recently shipped improved vector data type support for sources and added JDBC sink support to write vector embeddings to MySQL and PostgreSQL. Besides, there is a new Debezium Server sink for Milvus and a dedicated vector embeddings transformation.
- The Apache Flink Community published a helpful article about building Flink CDC pipelines on Alibaba’s Cloud blog. The first half provides a good first overview before switching to the hands-on part explaining how to build a data pipeline from MySQL to Paimon.
Paper of the Month
With “An Empirical Evaluation of Columnar Storage Formats” Xinyu Zeng et al. provide valuable insights into the design and optimization of columnar storage formats in the context of modern data processing needs. By highlighting the strengths and limitations of Parquet and ORC, the authors offer guidance for developing next-generation storage formats that are more aligned with current hardware capabilities and workload requirements. Their benchmark framework serves as a tool for future evaluations, ensuring that storage formats evolve to meet the demands of diverse and complex data environments.
Events & Call for Papers (CfP)
- DevoxxUK (London, UK) May 7-9
- Current 2025 (London, UK) May 20-21
- Snowflake Summit 2025 (San Francisco, CA, USA) June 2-5
- AI & Big Data Expo (Santa Clara, CA, USA) June 4-5‍
- Flink Forward 2025 (Barcelona, Spain) Oct 13-16, CfP open
New Releases
- Debezium 3.1.0.Final and 3.1.1.Final‍
- Apache Iceberg 1.7.2 and 1.9.0
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.