🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

March 11, 2022

min read

Decodable Platform Overview

Share this post

I joined Decodable about a year ago as one of the founding engineers. One year on, we’ve built a self-service real-time data platform supporting hundreds of accounts in multiple regions. In this blog post, I’ll share why we built Decodable, the design decisions we made, and the platform architecture.

Motivation

At Decodable, we firmly believe that real-time streaming is the future of data. However, many people still do not use real-time data processing because current systems are far too difficult to use.

A typical workflow to set up a real-time data pipeline might be:

Pull a few Docker images from the web that usually include: kafka, flink/spark, zookeeper
Clone some Github repo that usually has examples written in Java/Python/Scala
Modify the example to connect to the test data source and/or sink.
Cross your fingers and hope that everything would just work

When something goes wrong, the issue could be in many places:

Is there a network communication issue? Are host names and ports used in the example code actually reachable from the container environment?
Is the example code packaged and deployed correctly?
Is there enough disk/memory space for the Docker environment? Are those images configured correctly to work with each other?

It turns out that to get started with real-time data, you need deep and specialized knowledge like networking, operating systems, distributed systems, and language skills. Day 2 operations pile on the challenges: coordinating a schema change, testing pipelines, performing incremental deployments without losing data, creating a safe way to share data across teams, monitoring pipeline health and much more.

Design Approach

Our approach is to come up with a system that provides simple APIs with the right semantics and abstraction for a real-time data platform, and shift the complexity to the platform. Rather than worrying about the low-level details, we think it should be possible to think real-time data processing in terms of:

connections to external data systems
streams of data records
pipelines that process data in streams

The diagram below illustrates how connections, streams and pipelines connect:

Semantics

Streams are a log of schema-compatible data records. The platform enforces schema compatibility when streams, connections and pipelines are attached to each other. With this design, we can detect schema mismatch before running a connection or a pipeline job. The data contract prevents backward incompatible changes and forces users to use a safe schema migration workflow. It also allows for collaboration and re-use, since anyone can inspect a shared stream definition to understand the function of a project.

Connections transport data between a user’s external systems and Decodable. A connection is an instance of a connector - which encapsulates logic about how to connect to a specific type of system. Each connection is defined as either a source (input) or a sink (output). When a connection is activated, it produces records to the stream bound to it (source) or consumes records from the stream bound to it.

Pipelines are defined by SQL statements. Each pipeline is a single query of

INSERT INTO <output_stream> SELECT .. FROM <input_stream> (JOIN <another_input_stream>) ...

A pipeline reads input from one or more streams, performs processing, and outputs to exactly one stream. Pipelines also have a state: active and inactive. A pipeline only starts processing when it's in an active state.

Pipelines are also versioned. Each version is an immutable SQL definition. As an analogy, pipeline versions are a linear history of git commits. The versioned pipelines support user experiences such as automated testing, continuous deployment and deterministic rollback when failure happens. This greatly smooths the deployment process and reduces the operational cost.

Design Decisions

SQL

We choose SQL for our processing language because:

SQL is a high-level language that is very expressive for data processing as proven in the database and batch processing space. SQL is declarative - where the users focus on “what” the processing logic is and let the platform figure out “how” to process it.
We’ve seen the industry has gone through the cycle of SQL → no SQL → SQL. SQL has proven to be a good fit as a data processing language. As mentioned in this paper Dremel: A Decade of Interactive SQL Analysis at Web Scale "… all data platforms have embraced SQLstyle APIs as the predominant way to query and retrieve data…"

SQL is already widely known and used in the data engineering community to do data enrichment, filtering, masking, transformation, routing, aggregation and more.

Other choices like Python, Java, or Scala would be more flexible for users to configure performance (eg: specifying parallelism per operator) but harder to get started.

Small APIs

We started with small APIs and are willing to give up sophisticated command and control. This decision allows us to focus on refining the API semantics and user experience early while addressing the 80% of the common use cases. Going forward, the APIs will become more flexible to support more use cases.

Declarative APIs

Like SQL, our APIs are also declarative. The users only need to declare the intended state (eg: target_state=RUNNING), and the platform will figure out how to make it happen with actual_state=RUNNING. This is achieved through the protocol between the control plane and data plane in the Platform Architecture section below. At a high level, the user declared states are stored persistently in a database, and our data processing services reconcile themselves to match the states in the database.

Platform Architecture

Since we are building a self-service data platform, we also have the following design goals:

Reliable and robust: Data processing must be correct, resilient to failures (network partitions, external system outages, etc) and provide the delivery guarantee configured.
Separation of concerns: The control plane should not touch data. The processing jobs in the data plane should be strictly isolated (cross-account access is impossible) per account.
Resource usage & throughput: Optimize for performance and efficiency
Account isolation, backup and recovery: a strong guarantee for data security, recovery & minimized blast radius.

Below is an architecture diagram of our current platform.

We separate our services into two categories - the control plane and the data plane. The services in the control plane do the “intelligent command and control”, and the data plane services do the heavy lifting of processing data.

The Control plane provides APIs to configure, manage and monitor data plane resources. These services are multi-tenant, and they don’t touch customer data for processing. In the case of a partial or total failure of the control plane, the data plane continues to operate.

The Data plane services are usually single-tenant for better tenant isolation since they actively process user data. For example, each pipeline is run with a separate Flink cluster. In case of a failure, the downtime and recovery won’t affect any other pipelines. Services in the data plane regularly report status back to the control plane for monitoring purposes.

The separation of control and data plane allows us to scale different services separately and fine-tune the tenant isolation level. This design also allows us to run the whole data plane in the customer’s environment (eg: VPC) for security/privacy and co-location with data sources and sinks to reduce latency and cloud networking costs.

We also made a few system choices to bootstrap the platform as quickly as possible.

Apache Flink runs Decodable’s pipelines and connections. The open source Flink project is a powerful stateful stream processing engine that is fast, scalable, and robust. Running Flink in production, however, is like taming a beast. Our team has spent a lot of time deploying, configuring, and tuning Flink to support our APIs, which we believe are what our users expect. We’ll discuss the challenges and learnings in a future post.

Apache Kafka runs decodable’s streams. Kafka’s separation of producers and consumers, schema support, delivery guarantees, and configurable retention make it a good fit for us. In our platform, Kafka integrates with Flink through Flink’s Kafka connector. Similar to Flink, the main challenge with Kafka is also on the operational side.

Kubernetes is the infrastructure foundation for Decodable, enabling us to deploy and scale the platform. Each control plane and data plane forms a deployment unit that we call a cell. We deploy multiple cells across various cloud regions for redundancy and geographical distribution. Each cell also interacts with many external systems such as AWS S3, Secrets Manager, RDS, Auth0, Stripe, etc. The whole infrastructure design would be worth a separate post in the future.

What’s Next?

It’s been almost 6 months since welcoming our first beta customers onto Decodable, and our recent GA builds on all their feedback, ideas and requests making Decodable a smooth experience and valuable tool. We’ve got a lot of feature requests so there is a lot more to do (a very good problem to have!). Below are some of the interesting items we are looking into:

Additional cloud providers: Azure, GCP
Eco-system expansion: growing our library of connectors
Refining & integrating across the user and developer experience
Complex workflows such as schema migrations, state management, and logical changes
Driving incremental performance and efficiency gains

This is the first post of a technical deep-dive series where we presented a high level view of the platform. Going forward, we will drill down into specific areas: infrastructure design, system management, observability, and more.

‍

Start decoding today:

Watch a video demo of the Decodable Platform
Get Started for Free
Try our quickstart guide
Join our Slack community
Read the documentation

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Sharon Xie

Sharon is a founding engineer at Decodable. Currently she leads product management and development. She has over six years of experience in building and operating streaming data platforms, with extensive expertise in Apache Kafka, Apache Flink, and Debezium. Before joining Decodable, she served as the technical lead for the real-time data platform at Splunk, where her focus was on the streaming query language and developer SDKs.

May 9, 2022

min read

Flink Deployments At Decodable

Sharon Xie

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

Motivation

At Decodable, we firmly believe that real-time streaming is the future of data. However, many people still do not use real-time data processing because current systems are far too difficult to use.

A typical workflow to set up a real-time data pipeline might be:

Pull a few Docker images from the web that usually include: kafka, flink/spark, zookeeper
Clone some Github repo that usually has examples written in Java/Python/Scala
Modify the example to connect to the test data source and/or sink.
Cross your fingers and hope that everything would just work

When something goes wrong, the issue could be in many places:

Is there a network communication issue? Are host names and ports used in the example code actually reachable from the container environment?
Is the example code packaged and deployed correctly?
Is there enough disk/memory space for the Docker environment? Are those images configured correctly to work with each other?

Design Approach

connections to external data systems
streams of data records
pipelines that process data in streams

The diagram below illustrates how connections, streams and pipelines connect:

Semantics

Pipelines are defined by SQL statements. Each pipeline is a single query of

INSERT INTO <output_stream> SELECT .. FROM <input_stream> (JOIN <another_input_stream>) ...

Design Decisions

SQL

We choose SQL for our processing language because:

SQL is a high-level language that is very expressive for data processing as proven in the database and batch processing space. SQL is declarative - where the users focus on “what” the processing logic is and let the platform figure out “how” to process it.
We’ve seen the industry has gone through the cycle of SQL → no SQL → SQL. SQL has proven to be a good fit as a data processing language. As mentioned in this paper Dremel: A Decade of Interactive SQL Analysis at Web Scale "… all data platforms have embraced SQLstyle APIs as the predominant way to query and retrieve data…"

SQL is already widely known and used in the data engineering community to do data enrichment, filtering, masking, transformation, routing, aggregation and more.

Other choices like Python, Java, or Scala would be more flexible for users to configure performance (eg: specifying parallelism per operator) but harder to get started.

Small APIs

Declarative APIs

Platform Architecture

Since we are building a self-service data platform, we also have the following design goals:

Reliable and robust: Data processing must be correct, resilient to failures (network partitions, external system outages, etc) and provide the delivery guarantee configured.
Separation of concerns: The control plane should not touch data. The processing jobs in the data plane should be strictly isolated (cross-account access is impossible) per account.
Resource usage & throughput: Optimize for performance and efficiency
Account isolation, backup and recovery: a strong guarantee for data security, recovery & minimized blast radius.

Below is an architecture diagram of our current platform.

We also made a few system choices to bootstrap the platform as quickly as possible.

What’s Next?

Additional cloud providers: Azure, GCP
Eco-system expansion: growing our library of connectors
Refining & integrating across the user and developer experience
Complex workflows such as schema migrations, state management, and logical changes
Driving incremental performance and efficiency gains

‍

Start decoding today:

Watch a video demo of the Decodable Platform
Get Started for Free
Try our quickstart guide
Join our Slack community
Read the documentation

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Sharon Xie

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Decodable Platform Overview

Motivation

Design Approach

Semantics

Design Decisions

Platform Architecture

What’s Next?

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Flink Deployments At Decodable

Table of contents

Motivation

Design Approach

Semantics

Design Decisions

Platform Architecture

What’s Next?

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Related Posts

Flink Deployments At Decodable

Let's get decoding