Blog
Blog /

What is Change Data Capture? The Why? and How? of CDC.

Sharon Xie
Decodable

The Basics: CDC

Change Data Capture (CDC) is the idea of capturing changes made to data in a database and then delivering those change records in real-time to a downstream process or system.

Let’s start with a simple demonstration with the database replication process. There is a follower database that replicates the data from the leader database. You can view the changes to the leader database as a stream of record logs. Every time someone successfully writes to the leader database, that is a record in the stream. If you apply those events to the follower database in exactly the same order as the original database committed them, you end up with an exact copy of the database.

But we can do a lot more with the CDC records. The follower doesn’t need to be a follower database. It can also be a cache where you want to keep the most recent users, or a search index that’s updated whenever a new document is added, and many more. The idea is that you can attach different consumers to the CDC records for different use cases and they are completely decoupled from the original database and each other.

How Do I Build A System To Process CDC Records?

The CDC use cases we just reviewed are exciting and can be applied to a wide range of problems.

Now we “just” need to architect a system that can:

  • Continuously extract changes from the database
  • Continuously process the CDC records
  • Continuously issue the actions for the destination system to take

There are definitely lots of open source tools that allow you to build a system to do it. If you start down this road, you'll probably soon design a system that looks similar to this:

  • Debezium allows you to capture changes from various databases and deliver the CDC records to Kafka
  • Kafka can persist those CDC records and allow different consumers to process the events independently. Pulsar or Kinesis are common choices here too.
  • Flink is a powerful distributed stateful data processing engine that processes the CDC events. Other choices usually include Spark and KsqlDB.

Each system would do what’s promised but the overhead of deploying, configuring and operating each system, and making sure they work with each other seamlessly is non-trivial. That’s why Decodable offers it as a service.

Decodable Processes CDC Records As-A-Service

Decodable is a real-time data processing platform that provides first class service for CDC processing. Users simply connect to source databases and process CDC records in real-time without managing the underlying infrastructure.

In Decodable, CDC connectors emit change records, which are stored in change streams. Each change record contains both the type of modification: insertions, updates and deletions, and the modified data. These records are processed with respect to the modifications in pipelines. On the sink side, change streams can connect to connectors that support consuming change records (eg: Postgres Sink).

A typical workflow has three steps (the examples show here have been simplified for the purpose of demonstration):

1. Configure a Decodable’s CDC connector (eg: Postgres CDC, MySQL CDC) and it automatically provisions a change stream with CDC records in it. For example

2. Create a pipeline with the processing logic written in SQL. For example, the SQL query below counts all the orders that are not delivered per user.

The query above also automatically provisions an output change stream `non_delivered_count` that contains CDC records for the sink connectors to consume. Since the pipeline is processed continuously, the result is also updated with each input record. Records in the output stream looks like this:

3. Configure a sink connector that can consume changes to see the result in real-time. For example, if a Postgres sink is configured, the connection issues an action to the Postgres database for each record in the stream. They look like:

Note that the SQL syntax `INSERT .. ON CONFLICT .. DO UPDATE SET ..` is used to make sure the retractions (delete + create) are applied atomically.

Unlock more use cases with CDC

Now we know that Decodable can process CDC records in real-time and power destination systems to always have the up-to-date views. But there is no limit that you can only process one change stream at a time. With Flink’s powerful engine, and the accessible SQL interface, stream joins become much easier and you can join tables from different databases in real time!

Conclusion

At Decodable, we believe that with a general purpose database it is hard to achieve good performance over various use cases at scale. Specialized data systems are usually needed to serve purposes that they are designed for. With CDC and Decodable’s stream processing, you can easily build an architecture that encourages loose coupling, scaling and flexibility. Taking the existing data in your database, processing it separately, and delivering it to the systems that do the job well is a super powerful tool.


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

New CDC connectors for MySQL and PostgreSQL: Don’t Let Your Data be a Day Late and a Dollar Short

Decodable’s support for CDC (Change Data Capture) advanced further with the release of source CDC connectors for MySQL and PostgreSQL. Source CDC connectors convert traditional database tables into sources of streaming data for processing and delivery to sink systems via Decodable. This blog explains more about CDC with demos for the CDC connectors and Decodable's change processing capabilities.

Learn more

Flink Deployments At Decodable

Decodable’s platform uses Apache Flink to run our customers’ real-time data processing jobs. This blog post explores how we securely, reliably and efficiently manage the underlying Flink deployments at Decodable in a multi-tenant environment.

Learn more

Streaming Data Pipelines with SQL

Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this Data Engineering Podcast hosted by Tobias Macey, Eric Sammer, discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He explains how Decodable addresses that limitation and lets data engineers build streaming pipelines entirely in SQL.

Learn more

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Tags
Pintrest icon in black
CDC
Development
SQL

Start using Decodable today.