Blog
Blog /

Flink Deployments At Decodable

Sharon Xie
Decodable

Decodable’s platform uses Apache Flink to run our customers’ real-time data processing jobs. In this post, I will explain how we manage the underlying Flink deployments at Decodable in a multi-tenant environment. To see where Apache Flink fits in the overall Decodable platform, check out the Platform Overview blog which outlines our high-level platform architecture.

Flink has many traits that we love for real-time data processing. It is a distributed processing engine designed for stream processing - through stateful computations over unbounded data. It is tremendously scalable and is used by Alibaba, Netflix, Uber, Lyft, and a host of other companies of all sizes.

Decodable offers real-time data processing as a self-service platform, and - like every multi-tenant SaaS service - we must provide strict tenant isolation, uptime guarantees and performance SLAs. Let’s break down our requirements and see how we configured and deployed Flink to meet our needs.

Requirement: Low Cost to Create a Free Account

Our goal here is to minimize the initial time a user has to wait to have an account ready while minimizing the cost of resources. SaaS platform users typically experience delays when creating the initial account, due to provisioning of the underlying infrastructure dedicated to the account. It can easily take longer than 10 minutes to create a new user account.

As a user, you never want to hear that your account will be deleted because the provider's internal costs for free trial accounts are too high. At Decodable, we believe that our customers will experience the value of our platform and many will want to expand to the pay-as-you-go tier as their usage increases. Obviously, we incur costs from the free tier offering, and want to keep those costs to a minimum.

Solution: On-Demand Flink Resource Creation

Creating a Decodable account only takes seconds. This is largely because we don’t pre-create any resource for Flink at account creation time. Flink resources are only created when a data processing job is triggered, either through a preview session or a running instance of a pipeline or connection. When a job is terminated, the resources are cleaned up immediately.

Requirement: Strong Security and Isolation for Production Workloads

Once a data processing job is running in production, it should continue running without issues. This is a lot easier said than done! In a multi-tenant environment, we must prevent bad actors and mitigate noisy neighbors, as well as limiting the blast radius of any unexpected outages.

Solution: Flink Application Clusters for Pipeline and Connection Jobs

Flink offers different deployment modes to deploy natively on Kubernetes. Given the requirements here, we use application mode to run our connection and pipeline jobs. Application mode runs one job per cluster and the JobManager is solely responsible for one job. This isolation avoids noisy neighbors and resource contention. In addition, we configured  kubernetes resource limits for each Flink pod to ensure CPU and memory resources are shared fairly. We’ve also configured high availability for Flink clusters so they will auto recover if a pod is erroneously killed or during an unexpected incident.

Requirement: Quick Response to Preview Jobs

Preview is a feature where a user can test a pipeline running a SQL statement with real production data. Preview is critical to a user’s pipeline authoring experience because it helps users understand whether or not their pipeline will do what they expect on real data. In the iterative process of trying and testing, we want to minimize the cycle time.

Solution: Custom Flink Session Clusters for Preview Jobs

We deploy Flink session clusters to run our preview jobs. The main reason is that Flink session mode has a much lower start-up time because it uses a shared pre-deployed JobManager. However TaskManagers (the workers) are designed to be shared and make it hard to isolate the jobs from different tenants. To ensure job isolation in the session mode, we modified Flink so that each TaskManager only runs 1 job. Preview jobs are short-lived (no more than 2 minutes) and the TaskManagers are killed when the job times out. To further reduce the preview response time, we also modified Flink to pre-deploy some idle TaskManagers. When a TaskManager is used, a new TaskManager is immediately created so there is always a pool of idle TaskManagers.

In summary: Decodable uses Flink session clusters to run fast ephemeral jobs, and application clusters to run production jobs. We provide multi-tenant data isolation guarantees in both scenarios. We also modified Flink to boost its performance and we plan to contribute our changes back to the Flink community.

What’s Next at Decodable?

Running Flink in production has been a long journey but there is more to do in the future. Some of the interesting items in our backlog are:

  • Ensure network bandwidth is shared fairly among accounts.
  • Optimize performance for jobs with large states while ensuring fair share of resources (eg:disk/network usage).
  • Auto-adjust Flink resource based on the workload. Potentially extend the reactive mode to support native kubernetes deployments.
  • Figure out an intuitive and secure way to allow users to run custom code.

Get Started with Decodable

You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join Decodable!

Decodable Platform Overview

I joined Decodable about a year ago as one of the founding engineers. One year on, we’ve built a self-service real-time data platform supporting hundreds of accounts in multiple regions. In this blog post, I’ll share why we built Decodable, the design decisions we made, and the platform architecture.

Learn more

The Future of Real Time Data Engineering

Building pipelines should be fast, safe, and even fun. If you're a data engineer working in the batch world and need to handle real-time data, we've got you covered. If you're an application engineer building microservices or other data-powered infrastructure, Decodable is for you.

Learn more

We’re Abusing The Data Warehouse; RETL, ELT, And Other Weird Stuff.

By now, everyone has seen the rETL (Reverse ETL) trend: you want to use data from app #1 to enrich data in app #2. In this blog, Decodable's founder discusses the (fatal) shortcomings of this approach and how to get the job done.

Learn more

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Tags
Pintrest icon in black
Development
Product

Start using Decodable today.