Blog
Blog /

Real Time Streaming Joins With SQL

Robert Metzger
Decodable

Decodable enables continuous SQL joins across streaming sources including streaming systems like Kafka, Pulsar and Kinesis as well as databases with Change Data Capture (CDC). This is exactly the use-case for real-time joins. In this blog we'll explore how this works, and how it helps bring siloed data into play in low-latency streaming applications. Whenever a table in your database changes, Decodable instantly updates the join result.

Let’s assume you have a database table with all your customer data, and a table mapping the zip-codes to your retail stores, and your goal is to build a report counting the number of customers which are living close by a store. With a traditional database you would periodically run a SQL query to get that report. With modern streaming platforms like decodable, you are able to run such a query continuously. Whenever you get new or updated customer data, a store data changes, the report will update in real-time.

Combining a stream and a table in another example, you have a website clickstream of events coming from Confluent Cloud (a managed Apache Kafka service) and you want to enrich each stream event by looking up & adding customer information in a PostgreSQL table. You'll use this to tell you the last product page the customer viewed. You'll then store the result back into PostgreSQL, which will always be up to date with the latest page viewed. This is exactly the use case for real-time joins.

Currently, decodable supports running real-time joins with data from MySQL and PostgreSQL databases. We are currently working on adding more databases, such as Oracle, Microsoft SQL Server, Snowflake and more.

Benefits Of Real-Time Joins

What are the benefits of real-time joins compared to traditional joins, executed periodically?

The main benefit is the reduced latency. Historically you'd use a batch mode to execute a join, orchestrated by Apache Airflow or AWS Step Functions triggering a query, say once per hour. Even if the query takes only a few seconds to run, the average latency will be 30 minutes, the long tail latency much higher.

We know from our customers that data is most valuable when fresh. In other words, the higher the latency you have in your processing pipeline, the less value you can extract from it. For example a global logistics company which is already using stream processing has a competitive advantage if it can inform customers in real-time about the locations of their goods, and notify them in case of delays. A traditional "bricks and mortar" retailer can also benefit from bringing online experiences into their stores: real-time promotions, post-checkout surveys or upsells like product insurance.

Another benefit is data integration across many systems: It is often difficult to join data located in different databases - the data is locked in. How do you extract and normalize data coming from a traditional Oracle database, Snowflake or MongoDB? With decodable’s connector ecosystem and stream processing capabilities, this can be done in a matter of minutes.

Real-time joins are also resource efficient and fast, because the joins are calculated based on the change / delta of the data, rather than scanning the entire database.

Creating Real-Time Joins In Decodable

Decodable is a platform for real-time data processing. Data is represented in streams. To get data into a stream, you need to set up connectors. A connector can read data from systems like Kafka or Kinesis. Also, we support connecting to databases such as MySQL or PostgreSQL. In case of a database, the connector will read the transaction log from a database into a decodable stream.

Once you have multiple streams of transaction logs (we call them change streams, because they represent change events in a database), you can set up pipelines to process them. A pipeline is defined by its input streams, an output stream and a SQL query describing, in our example, a join operation:

A pipeline output stream can then be used by other pipelines or by a connector writing the data somewhere, such as MySQL, PostgreSQL or Kafka.

Conclusion

Almost all data is produced in real-time, but typically we capture the data in a static database or data warehouse before processing it. Traditionally, this data has been stored and then processed in batches, mostly due to technical limitations and the perceived challenges of stream processing. However, companies like Uber, Lyft, Amazon, Netflix and Stripe have embraced real-time stream processing for a few years now, mostly using internal stream processing platforms. Decodable removes the technical and skills barriers to adopting streaming data by offering a fully managed serverless stream processing platform, using SQL as the transformation language.

Real-time joins are among the more advanced use-cases for stream processing, as they require a lot of different infrastructure components for getting the data out of databases, and for efficiently executing real-time joins. Decodable eliminates this complexity, making streaming joins available to everyone.

Create a free decodable account today to try out streaming joins yourself!


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

New CDC connectors for MySQL and PostgreSQL: Don’t Let Your Data be a Day Late and a Dollar Short

Decodable’s support for CDC (Change Data Capture) advanced further with the release of source CDC connectors for MySQL and PostgreSQL. Source CDC connectors convert traditional database tables into sources of streaming data for processing and delivery to sink systems via Decodable. This blog explains more about CDC with demos for the CDC connectors and Decodable's change processing capabilities.

Learn more

What is Change Data Capture? The Why? and How? of CDC.

Change Data Capture (CDC) is the idea of capturing changes made to data in a database and then delivering those change events in real-time to a downstream process or system. In this blog we'll break down why this is useful and how to build a system to process CDC streams.

Learn more

We’re Abusing The Data Warehouse; RETL, ELT, And Other Weird Stuff.

By now, everyone has seen the rETL (Reverse ETL) trend: you want to use data from app #1 to enrich data in app #2. In this blog, Decodable's founder discusses the (fatal) shortcomings of this approach and how to get the job done.

Learn more

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Tags
Pintrest icon in black
CDC
Development
SQL

Start using Decodable today.