I joined Decodable about a year ago as one of the founding engineers. One year on, we’ve built a self-service real-time data platform supporting hundreds of accounts in multiple regions. In this blog post, I’ll share why we built Decodable, the design decisions we made, and the platform architecture.
At Decodable, we firmly believe that real-time streaming is the future of data. However, many people still do not use real-time data processing because current systems are far too difficult to use.
A typical workflow to set up a real-time data pipeline might be:
- Pull a few Docker images from the web that usually include: kafka, flink/spark, zookeeper
- Clone some Github repo that usually has examples written in Java/Python/Scala
- Modify the example to connect to the test data source and/or sink.
- Cross your fingers and hope that everything would just work
When something goes wrong, the issue could be in many places:
- Is there a network communication issue? Are host names and ports used in the example code actually reachable from the container environment?
- Is the example code packaged and deployed correctly?
- Is there enough disk/memory space for the Docker environment? Are those images configured correctly to work with each other?
It turns out that to get started with real-time data, you need deep and specialized knowledge like networking, operating systems, distributed systems, and language skills. Day 2 operations pile on the challenges: coordinating a schema change, testing pipelines, performing incremental deployments without losing data, creating a safe way to share data across teams, monitoring pipeline health and much more.
Our approach is to come up with a system that provides simple APIs with the right semantics and abstraction for a real-time data platform, and shift the complexity to the platform. Rather than worrying about the low-level details, we think it should be possible to think real-time data processing in terms of:
- connections to external data systems
- streams of data records
- pipelines that process data in streams
The diagram below illustrates how connections, streams and pipelines connect:
Streams are a log of schema-compatible data records. The platform enforces schema compatibility when streams, connections and pipelines are attached to each other. With this design, we can detect schema mismatch before running a connection or a pipeline job. The data contract prevents backward incompatible changes and forces users to use a safe schema migration workflow. It also allows for collaboration and re-use, since anyone can inspect a shared stream definition to understand the function of a project.
Connections transport data between a user’s external systems and Decodable. A connection is an instance of a connector - which encapsulates logic about how to connect to a specific type of system. Each connection is defined as either a source (input) or a sink (output). When a connection is activated, it produces records to the stream bound to it (source) or consumes records from the stream bound to it.
Pipelines are defined by SQL statements. Each pipeline is a single query of
INSERT INTO <output_stream> SELECT .. FROM <input_stream> (JOIN <another_input_stream>) ...
A pipeline reads input from one or more streams, performs processing, and outputs to exactly one stream. Pipelines also have a state: active and inactive. A pipeline only starts processing when it's in an active state.
Pipelines are also versioned. Each version is an immutable SQL definition. As an analogy, pipeline versions are a linear history of git commits. The versioned pipelines support user experiences such as automated testing, continuous deployment and deterministic rollback when failure happens. This greatly smooths the deployment process and reduces the operational cost.
We choose SQL for our processing language because:
- SQL is a high-level language that is very expressive for data processing as proven in the database and batch processing space. SQL is declarative - where the users focus on “what” the processing logic is and let the platform figure out “how” to process it.
- We’ve seen the industry has gone through the cycle of SQL → no SQL → SQL. SQL has proven to be a good fit as a data processing language. As mentioned in this paper Dremel: A Decade of Interactive SQL Analysis at Web Scale "… all data platforms have embraced SQLstyle APIs as the predominant way to query and retrieve data…"
- SQL is already widely known and used in the data engineering community to do data enrichment, filtering, masking, transformation, routing, aggregation and more.
Other choices like Python, Java, or Scala would be more flexible for users to configure performance (eg: specifying parallelism per operator) but harder to get started.
We started with small APIs and are willing to give up sophisticated command and control. This decision allows us to focus on refining the API semantics and user experience early while addressing the 80% of the common use cases. Going forward, the APIs will become more flexible to support more use cases.
Like SQL, our APIs are also declarative. The users only need to declare the intended state (eg: target_state=RUNNING), and the platform will figure out how to make it happen with actual_state=RUNNING. This is achieved through the protocol between the control plane and data plane in the Platform Architecture section below. At a high level, the user declared states are stored persistently in a database, and our data processing services reconcile themselves to match the states in the database.
Since we are building a self-service data platform, we also have the following design goals:
- Reliable and robust: Data processing must be correct, resilient to failures (network partitions, external system outages, etc) and provide the delivery guarantee configured.
- Separation of concerns: The control plane should not touch data. The processing jobs in the data plane should be strictly isolated (cross-account access is impossible) per account.
- Resource usage & throughput: Optimize for performance and efficiency
- Account isolation, backup and recovery: a strong guarantee for data security, recovery & minimized blast radius.
Below is an architecture diagram of our current platform.
We separate our services into two categories - the control plane and the data plane. The services in the control plane do the “intelligent command and control”, and the data plane services do the heavy lifting of processing data.
The Control plane provides APIs to configure, manage and monitor data plane resources. These services are multi-tenant, and they don’t touch customer data for processing. In the case of a partial or total failure of the control plane, the data plane continues to operate.
The Data plane services are usually single-tenant for better tenant isolation since they actively process user data. For example, each pipeline is run with a separate Flink cluster. In case of a failure, the downtime and recovery won’t affect any other pipelines. Services in the data plane regularly report status back to the control plane for monitoring purposes.
The separation of control and data plane allows us to scale different services separately and fine-tune the tenant isolation level. This design also allows us to run the whole data plane in the customer’s environment (eg: VPC) for security/privacy and co-location with data sources and sinks to reduce latency and cloud networking costs.
We also made a few system choices to bootstrap the platform as quickly as possible.
Apache Flink runs Decodable’s pipelines and connections. The open source Flink project is a powerful stateful stream processing engine that is fast, scalable, and robust. Running Flink in production, however, is like taming a beast. Our team has spent a lot of time deploying, configuring, and tuning Flink to support our APIs, which we believe are what our users expect. We’ll discuss the challenges and learnings in a future post.
Apache Kafka runs decodable’s streams. Kafka’s separation of producers and consumers, schema support, delivery guarantees, and configurable retention make it a good fit for us. In our platform, Kafka integrates with Flink through Flink’s Kafka connector. Similar to Flink, the main challenge with Kafka is also on the operational side.
Kubernetes is the infrastructure foundation for Decodable, enabling us to deploy and scale the platform. Each control plane and data plane forms a deployment unit that we call a cell. We deploy multiple cells across various cloud regions for redundancy and geographical distribution. Each cell also interacts with many external systems such as AWS S3, Secrets Manager, RDS, Auth0, Stripe, etc. The whole infrastructure design would be worth a separate post in the future.
It’s been almost 6 months since welcoming our first beta customers onto Decodable, and our recent GA builds on all their feedback, ideas and requests making Decodable a smooth experience and valuable tool. We’ve got a lot of feature requests so there is a lot more to do (a very good problem to have!). Below are some of the interesting items we are looking into:
- Additional cloud providers: Azure, GCP
- Eco-system expansion: growing our library of connectors
- Refining & integrating across the user and developer experience
- Complex workflows such as schema migrations, state management, and logical changes
- Driving incremental performance and efficiency gains
This is the first post of a technical deep-dive series where we presented a high level view of the platform. Going forward, we will drill down into specific areas: infrastructure design, system management, observability, and more.
Start decoding today:
- Watch a video demo of the Decodable Platform
- Get Started for Free
- Try our quickstart guide
- Join our Slack community
- Read the documentation