Back
October 14, 2022
10
min read

What is Change Data Capture?

By
Sharon Xie
Share this post

The Basics of CDC

Change Data Capture (CDC) is the idea of capturing changes made to data in a database and then delivering those change records in real-time to a downstream process or system.

Let’s start with a simple demonstration with the database replication process. There is a follower database that replicates the data from the leader database. You can view the changes to the leader database as a stream of record logs. Every time someone successfully writes to the leader database, that is a record in the stream. If you apply those events to the follower database in exactly the same order as the original database committed them, you end up with an exact copy of the database.

But we can do a lot more with the CDC records. The follower doesn’t need to be a follower database. It can also be a cache where you want to keep the most recent users, or a search index that’s updated whenever a new document is added, and many more. The idea is that you can attach different consumers to the CDC records for different use cases and they are completely decoupled from the original database and each other.

How Do I Build a System to Process CDC Records?

The CDC use cases we just reviewed are exciting and can be applied to a wide range of problems.

Now we “just” need to architect a system that can:

  • Continuously extract changes from the database
  • Continuously process the CDC records
  • Continuously issue the actions for the destination system to take

There are definitely a lot of open-source tools that allow you to build a system to do it. If you start down this road, you'll probably soon design a system that looks similar to this:

  • Debezium allows you to capture changes from various databases and deliver the CDC records to Kafka
  • Kafka can persist those CDC records and allow different consumers to process the events independently. Pulsar or Kinesis are common choices here too.
  • Flink is a powerful distributed stateful data processing engine that processes the CDC events. Other choices usually include Spark and KsqlDB.

Each system would do what’s promised but the overhead of deploying, configuring and operating each system, and making sure they work with each other seamlessly is non-trivial. That’s why Decodable offers it as a service.

Decodable Processes CDC Records As-A-Service

Decodable is a real-time data processing platform that provides first class service for CDC processing. Users simply connect to source databases and process CDC records in real-time without managing the underlying infrastructure.

In Decodable, CDC connectors emit change records, which are stored in change streams. Each change record contains both the type of modification: insertions, updates and deletions, and the modified data. These records are processed with respect to the modifications in pipelines. On the sink side, change streams can connect to connectors that support consuming change records (eg: Postgres Sink).

A typical workflow has three steps (the examples show here have been simplified for the purpose of demonstration):

1. Configure a Decodable’s CDC connector (eg: Postgres CDC, MySQL CDC) and it automatically provisions a change stream with CDC records in it. For example:

2. Create a pipeline with the processing logic written in SQL. For example, the SQL query below counts all the orders that are not delivered per user.

The query above also automatically provisions an output change stream non_delivered_count that contains CDC records for the sink connectors to consume. Since the pipeline is processed continuously, the result is also updated with each input record. Records in the output stream looks like this:

3. Configure a sink connector that can consume changes to see the result in real-time. For example, if a Postgres sink is configured, the connection issues an action to the Postgres database for each record in the stream. They look like:

Note that the SQL syntax INSERT...ON CONFLICT...DO UPDATE SET... is used to make sure the retractions (delete + create) are applied atomically.

Unlock More Use Cases with CDC

Now we know that Decodable can process CDC records in real-time and power destination systems to always have the up-to-date views. But there is no limit that you can only process one change stream at a time. With Flink’s powerful engine and the accessible SQL interface, stream joins become much easier—you can join tables from different databases in real time!

Conclusion

At Decodable, we believe that with a general-purpose database it is hard to achieve good performance over various use cases at scale. Specialized data systems are usually needed to serve purposes that they are designed for. With CDC and Decodable’s stream processing, you can easily build an architecture that encourages loose coupling, scaling and flexibility. Taking the existing data in your database, processing it separately, and delivering it to the systems that do the job well is a super powerful tool.


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Sharon Xie

Sharon is a founding engineer at Decodable. Currently she leads product management and development. She has over six years of experience in building and operating streaming data platforms, with extensive expertise in Apache Kafka, Apache Flink, and Debezium. Before joining Decodable, she served as the technical lead for the real-time data platform at Splunk, where her focus was on the streaming query language and developer SDKs.

The Basics of CDC

Change Data Capture (CDC) is the idea of capturing changes made to data in a database and then delivering those change records in real-time to a downstream process or system.

Let’s start with a simple demonstration with the database replication process. There is a follower database that replicates the data from the leader database. You can view the changes to the leader database as a stream of record logs. Every time someone successfully writes to the leader database, that is a record in the stream. If you apply those events to the follower database in exactly the same order as the original database committed them, you end up with an exact copy of the database.

But we can do a lot more with the CDC records. The follower doesn’t need to be a follower database. It can also be a cache where you want to keep the most recent users, or a search index that’s updated whenever a new document is added, and many more. The idea is that you can attach different consumers to the CDC records for different use cases and they are completely decoupled from the original database and each other.

How Do I Build a System to Process CDC Records?

The CDC use cases we just reviewed are exciting and can be applied to a wide range of problems.

Now we “just” need to architect a system that can:

  • Continuously extract changes from the database
  • Continuously process the CDC records
  • Continuously issue the actions for the destination system to take

There are definitely a lot of open-source tools that allow you to build a system to do it. If you start down this road, you'll probably soon design a system that looks similar to this:

  • Debezium allows you to capture changes from various databases and deliver the CDC records to Kafka
  • Kafka can persist those CDC records and allow different consumers to process the events independently. Pulsar or Kinesis are common choices here too.
  • Flink is a powerful distributed stateful data processing engine that processes the CDC events. Other choices usually include Spark and KsqlDB.

Each system would do what’s promised but the overhead of deploying, configuring and operating each system, and making sure they work with each other seamlessly is non-trivial. That’s why Decodable offers it as a service.

Decodable Processes CDC Records As-A-Service

Decodable is a real-time data processing platform that provides first class service for CDC processing. Users simply connect to source databases and process CDC records in real-time without managing the underlying infrastructure.

In Decodable, CDC connectors emit change records, which are stored in change streams. Each change record contains both the type of modification: insertions, updates and deletions, and the modified data. These records are processed with respect to the modifications in pipelines. On the sink side, change streams can connect to connectors that support consuming change records (eg: Postgres Sink).

A typical workflow has three steps (the examples show here have been simplified for the purpose of demonstration):

1. Configure a Decodable’s CDC connector (eg: Postgres CDC, MySQL CDC) and it automatically provisions a change stream with CDC records in it. For example:

2. Create a pipeline with the processing logic written in SQL. For example, the SQL query below counts all the orders that are not delivered per user.

The query above also automatically provisions an output change stream non_delivered_count that contains CDC records for the sink connectors to consume. Since the pipeline is processed continuously, the result is also updated with each input record. Records in the output stream looks like this:

3. Configure a sink connector that can consume changes to see the result in real-time. For example, if a Postgres sink is configured, the connection issues an action to the Postgres database for each record in the stream. They look like:

Note that the SQL syntax INSERT...ON CONFLICT...DO UPDATE SET... is used to make sure the retractions (delete + create) are applied atomically.

Unlock More Use Cases with CDC

Now we know that Decodable can process CDC records in real-time and power destination systems to always have the up-to-date views. But there is no limit that you can only process one change stream at a time. With Flink’s powerful engine and the accessible SQL interface, stream joins become much easier—you can join tables from different databases in real time!

Conclusion

At Decodable, we believe that with a general-purpose database it is hard to achieve good performance over various use cases at scale. Specialized data systems are usually needed to serve purposes that they are designed for. With CDC and Decodable’s stream processing, you can easily build an architecture that encourages loose coupling, scaling and flexibility. Taking the existing data in your database, processing it separately, and delivering it to the systems that do the job well is a super powerful tool.


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Sharon Xie

Sharon is a founding engineer at Decodable. Currently she leads product management and development. She has over six years of experience in building and operating streaming data platforms, with extensive expertise in Apache Kafka, Apache Flink, and Debezium. Before joining Decodable, she served as the technical lead for the real-time data platform at Splunk, where her focus was on the streaming query language and developer SDKs.