Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling (your editor-in-chief for this edition) and Robin Moffatt. Feel free to send our way any choice nuggets that you think we should feature in future editions.
Stream Processing, Streaming SQL, and Streaming Databases
- Streaming SQL with Apache Flink: A Gentle Introduction Flink SQL exposes Flink’s powerful stream processing capabilities to a large audience of SQL-savvy data engineers. This post by Giannis Polyzos is an excellent introduction to Flink SQL, discussing different kinds of query operators, checkpointing, and more.
- Stream processing becomes mainstream A high-level overview on the current state of stream processing by Javier Redondo, discussing the differences between stream-to-sink and stream-to-table frameworks as well as between libraries and cluster frameworks.
- Building a Fully Managed Apache Flink Service, Behind the Scenes Decodable software engineer Jared Breeden explores in this post what it takes to build a fully managed platform for stream processing based on Flink, touching on aspects like developer experience, observability, schema management, etc.
- What is stateful stream processing? Insightful post by Arroyo’s Micah Wylde, explaining what the “state” in stateful stream processing is about, how systems like Flink and Arroyo deal with it, and how checkpointing ensures consistency after failures.
- Yes, Change Data Capture Still Breaks Database Encapsulation This one is a reaction by Chris Riccomini to my recent article which in turn was triggered by Chris’ original post, all on the same subject. Overall I feel Chris and I are not really in disagreement that much, it’s more that I described building blocks and implementation techniques, whereas Chris is looking for a ready-made product experience.
- CDC Streaming ELT Framework Flink CDC 3.0, released at the end of last year, introduces a new ELT framework, aiming to simplify the definition of CDC-driven data pipelines. The reference documentation is a great starting point to learn more about this project, which was just adopted into the upstream Apache Flink project earlier this month.
- Sliding window rate limits in distributed systems Naveen Kumar Jakuva Premkumar and Abdullah Al Mamun of Grab discuss in this in-depth post how they rate limit the number of marketing emails and push notifications sent to their users, using two of my favorite data structures: bloom filters and roaring bitmaps.
- Understanding lag in a streaming pipeline New Relic is known as a large-scale user of Apache Kafka. So it’s always interesting to learn about their experiences from running data streaming pipelines. Amy Boyle explains how they identify lag in pipelines using different techniques, as well as their strategies for automatically and dynamically adapting to it.
- An overview of Cloudflare's logging pipeline Cloudflare’s blog is a great source of insightful posts on the technologies they use. In this article, Colin Douch gives an overview on Cloudflare’s logging pipeline, based on Apache Kafka, connecting clusters in multiple data centers using Mirror Maker, and streaming logs to queryable systems such as Clickhouse and Elasticsearch.
- How Mixpanel Built a “Fast Lane” for Our Modern Data Stack A slightly longer read, but definitely worth the time: Illirik Smirnov shares experiences from his work at Mixpanel for creating a near-real-time data pipeline based on Google Cloud Pub/Sub for use cases where their existing (reverse) ETL system doesn’t provide the required latency SLAs.
Change Data Capture
- Logical Replication From Postgres 16 Stand-By Servers As of version 16, Postgres supports logical replication from read replicas. In this two-part blog series, I am taking a deep dive into this new functionality, showing how to make use of it in general, how to connect Debezium to read replicas, and how to handle replication slots in case of fail-over scenarios.
- PG Slot Notify: Monitor Postgres Slot Growth in Slack Making sure that Postgres replication slots don’t consume too much disk space is a key concern for every Postgres DBA. Kaushik Iska presents a neat solution to this problem in the form of PG Slot Notify, a chat bot which sends alerts to designated Slack channels in case a slot grows beyond a configured threshold.
- Debezium and TimescaleDB Support for TimescaleDB—a time series database based on Postgres—has been on the wishlist for many Debezium users for quite some time. This has finally become a reality in Debezium 2.5.0. Debezium’s project lead Jiri Pechanec discusses this new feature in this post, including the capability to capture changes to continuously updated aggregates.
- Streamlined Performance: Debezium JDBC connector batch support For the longest time, the Debezium project focused on the source side of real-time data pipelines. This has changed with the recent addition of a JDBC sink connector. Fiore Mario Vitale dives into some performance improvements to this connector.
Data Platforms and Architecture
- Our First Netflix Data Engineering Summit Netflix data engineers shared their learnings around building reliable data pipelines in an internal conference last year. The sessions by Holden Karau and others have been published on YouTube now, including gems such as Streaming SQL on Data Mesh using Apache Flink and Psyberg, An Incremental ETL Framework Using Iceberg.
- Designing A Data-Intensive Future: An Unscripted Journey with Martin Kleppmann Jesse Anderson interviews data legend Martin Kleppmann, talking about the evolution of data systems since Martin’s book “Designing Data-Intensive Applications” came out in 2017, his current work on local-first collaboration software, and his thoughts about being in academia.
- Handling Imbalanced Traffic with Kafka Swimlanes Mixing messages from different sources on one Kafka topic can cause a delay in processing when there are sudden spikes in the messages coming in from one producer. Angus Gibbs of HubSpot discusses how they address that issue, for instance routing backfill traffic and real-time traffic to different topics.
- API-First Approach to Kafka Topic Creation Managing Kafka topics in a consistent, secure, and reliable manner, ideally as self-service for developers, is an evergreen topic for data platform teams. Varun Chakravarthy et al. describe in this post how they've solved this problem at DoorDash by means of infra-structure-as-code using Pulumi.
- Kafka on Kubernetes: Reloaded for fault tolerance The Strimzi operator is a very popular way for running Apache Kafka on Kubernetes. Fabrice Harbulot and Thang Le of Grab discuss how they have designed their Kafka cluster deployments with a strong focus on fault tolerance, leveraging Strimzi’s rolling deployment mechanism and EBS volumes for persisting the data from their Kafka topics.
- Can Event-Driven Architecture make Software Design Easier? An episode of Kris Jenkins’ “Developer Voices” podcast, talking about everything events: event systems, Event Sourcing, the CloudEvents specification, and more. And it being a podcast with Kris, of course there was a mention of Clojure too.
- Using Server Sent Events to Simplify Real-time Streaming at Scale Every year, Shopify ships their Black Friday Cyber Monday (BFCM) live map, a real-time visualization of Shopify sales. In this article, Bao Nguyen discusses the latest version of this map, built using Server-Sent Events (SSE) and Apache Flink for processing raw merchant sales data coming in via Apache Kafka.
- 1️⃣🐝🏎️🦆 (1BRC in SQL with DuckDB) Robin takes a stab at the One Billion Row Challenge in this post—and staying true to his SQL personality, he’s using DuckDB for it. Unsurprisingly, everyone’s favorite embeddable analytics database does really well.
- How Apple built iCloud to store billions of databases Leonardo Creed takes a look under the hood of CloudKit, Apple’s backend service for iCloud, and how they leverage FoundationDB and Apache Cassandra, managing hundreds of petabytes of data.
- Kubernetes for Data Engineers While most data engineers don’t need to work directly with Kubernetes in their jobs (you can decide whether that’s good or bad ;), it can still be interesting to learn about the core ideas and concepts. Daniel Beach provides a nice introduction to Kubernetes in this post.
- Are you using Protocol Buffers as serialization format with Kafka? I asked this question on Twit…, mh, X, the other day, and I was really surprised by the large number of folks who shared their experiences from using ProtoBuf. It seems it’s way more popular than I had thought, with some advantages over Avro, including better support for languages other than Java, a less verbose format for defining schemas, support for partial deserialization, and others.
- Super-fast deduplication of large datasets using Splink and DuckDB Deduplicating data is a common task for most data engineers. Splink is an open-source Python tool designed for this task, and Robin Linacre puts it into action in this post together with DuckDB, deduplicating a dataset of seven million rows, comparing runtimes on EC2 instances of different sizes.
Paper of the Month
📄 DBLog: A Watermark Based Change-Data-Capture Framework (arXiv:2010.12597)
In this paper from 2020, Andreas Andreakis and Ioannis Papapanagiotou, back then working on a CDC solution at Netflix, propose an innovative algorithm for running backfills of existing data (“snapshotting”) concurrently to reading changes from the transaction log, leveraging a windowed de-duplication approach. Adopted by Debezium, Flink CDC, and other CDC solutions, this has been a massive improvement for data engineering teams running CDC pipelines.
Events & Call for Papers (CfP)
- GeeCON (Kraków, Poland) May 15-17 (CfP closes Jan 31st)
- JNation (Coimbra, Portugal) June 4-5 (CfP closes Jan 31st)
- JPrime (Sofia, Bulgaria) May 28-29 (CfP closes Feb 15th)
- NDC Oslo (Oslo, Norway) June 10-14 (CfP closes Feb 18th)
- Berlin Buzzwords (Berlin, Germany) June 9-11 (CfP closes Feb 25th)
- Current '24 | The Next Generation of Kafka Summit (Austin, TX) Sep 17-18 (CfP closes Feb 26th)
That’s all for this month! We hope you’ve enjoyed the newsletter and would love to hear about any feedback or suggestions you’ve got.