Back
August 10, 2023
6
min read

Understanding the Apache Flink Journey

Apache Flink is one of the most widely used data processing platforms due to its true streaming semantics, high throughput, and low latency capabilities. Flink offers several benefits for building stream processing applications, including:

  • Scalable architecture capable of processing very large volumes of data in real-time that can scale horizontally by adding more machines to the cluster.
  • Highly performant design enables processing data streams with low latency and high throughput, allowing it to support real-time applications.
  • Robust distributed architecture that can recover from failures and ensure that data is processed correctly.
  • A wide range of APIs and integrations with other systems makes it easy to build and deploy a variety of big data applications.
  • Stateful processing allows developers to build applications that maintain state and perform calculations over a period of time, making it suitable for complex, event-driven applications.

Data at Rest vs. Data in Motion

Many application developers are used to building applications atop databases like PostgreSQL, where their data is at rest. Streaming applications are basically applications that are built using data that is in motion. Building your own Flink application using an imperative language like Java will require a deep and thorough understanding of streaming concepts. The semantics of streaming are very different from applications that use data at rest because they tend to follow batch processing semantics. For example, when you aggregate or join streaming data, you’re required to provide a window, which is a time scope defined by a start and end time that is always moving forward. In contrast, windows are not required for batch processing since the entire data set is processed.

For most teams, their Apache Flink journey begins with core streaming use cases like fraud detection, real-time logistics/inventory, transaction processing, and Customer 360 analytics. They first use tools that they are familiar with, and then often bump up against those tools’ limitations. Here we provide a short recap of what comes next, as well as offer some recommendations grounded in having been through this journey with dozens of similarly situated teams.

Once the team has identified that there is a strong need for real-time stateful stream processing, they evaluate available options for their core engine. Teams most frequently discover Flink is the best fit for the same reasons that Netflix, Uber, Lyft, Airbnb, and others have chosen Flink — chiefly performance, scalability, correctness, and reliability.

At this point, they outline additional specific requirements for a platform built around Flink:

  • Easy to use for internal stakeholders
  • Doesn't overload team with operational burden
  • Uses resources wisely
  • Integrates with everything (interoperable)
If data scientists can copy & paste SQL from Snowflake or DBT and it just works, that would be magical.

Stages of Implementation

Once the team downloads and begins the journey of implementing Flink, they start to experience its complexity. In our experience, teams adopting Flink go through these stages:

Stages of Apache Flink Implementation

Stage 3 or 4 is where many firms either face blocking challenges or become mired in troubleshooting issues. Firms who have multiple people on their team with previous Flink experience have a much higher chance of successfully exiting these early stages. However, stage 5 requires a concerted investment by the executive team to properly staff and support their Flink deployment. This is the stage where Flink must be turned into a platform. It is important to appreciate that Flink works differently than technologies like Kafka Streams or Apache Beam. Platform engineers may find their previous knowledge and experience with those technologies fails them when working to adapt to Flink.

To be successful in creating, maintaining, and gaining ongoing business value out of a production-ready stream processing platform, senior leadership in each department must win the hearts and minds of data scientists, software engineers, data engineers, and other users of the platform. We have seen that comprehensive training for everyone involved is a key to success. Unfortunately, in our survey of over 100 companies we have rarely seen firms succeed with Flink based on training alone—the skill gap is often too large. It’s critical to note that to garner this buy-in and achieve the desired results, your engineering team will be asked to develop key user experience improvements. The platform must meet your users’ needs at their current skill level.

All told, the estimated costs for most significant Flink deployments are between $1.5 to 2.5 million to operate, increasing each year as the platform matures, driving a corresponding increase in usage and complexity, with most of that cost being talent and labor.

The Decodable Advantage

Decodable offers an opportunity to de-risk the entire Apache Flink deployment project, accelerate time to value from 12+ months down to 4 months, give back talented employees to more business-focused tasks, and avoid $1 to 2 million per year in costs. Our fully managed Platform as a Service is designed with ease-of-use at its core and drastically reduces the TCO and operational burden. Adopting Decodable gives you an immediate win that delights your software engineers, data engineers, and stakeholders.

Additional suggestions from Decodable’s team of Flink experts:

  1. Interview users on their UX requirements and build that into your project plan. Do not expect users to be successful with generic Flink. Focus especially on the developer experience of building new pipelines, which will largely determine whether the platform experiences successful adoption.
  2. Consider how you will handle hybrid mode for backfill/baseline, as well as how to handle schema evolution of stateful jobs.
  3. Evaluate rigorously whether to support imperative code or Flink SQL. Supporting FlinkSQL instantly solves the problem of tuning the job manager, which can become an endless bottleneck if you support arbitrary code.
  4. Evaluate your ability to hire Flink-experienced engineers. Assigning only Flink novices to this undertaking doesn’t set the team up for success and adds unnecessary risk to the project.
  5. Identify “hero customers” and earn sponsorship from their senior leadership to help push through when the project hits obstacles.
  6. Make a list of connectors that will be required and include this development effort in the project plan.
  7. Engage a vendor partner for Flink knowledge transfer, support, and training.
  8. Consider what ongoing operations and support will look like. Employees who gain Flink experience will become very valuable in the talent market and may look to leave, and are unlikely to want to answer support tickets on an ongoing basis.

Additional Resources

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
David Fabritius

Table of contents

Let's Get Decoding

Apache Flink is one of the most widely used data processing platforms due to its true streaming semantics, high throughput, and low latency capabilities. Flink offers several benefits for building stream processing applications, including:

  • Scalable architecture capable of processing very large volumes of data in real-time that can scale horizontally by adding more machines to the cluster.
  • Highly performant design enables processing data streams with low latency and high throughput, allowing it to support real-time applications.
  • Robust distributed architecture that can recover from failures and ensure that data is processed correctly.
  • A wide range of APIs and integrations with other systems makes it easy to build and deploy a variety of big data applications.
  • Stateful processing allows developers to build applications that maintain state and perform calculations over a period of time, making it suitable for complex, event-driven applications.

Data at Rest vs. Data in Motion

Many application developers are used to building applications atop databases like PostgreSQL, where their data is at rest. Streaming applications are basically applications that are built using data that is in motion. Building your own Flink application using an imperative language like Java will require a deep and thorough understanding of streaming concepts. The semantics of streaming are very different from applications that use data at rest because they tend to follow batch processing semantics. For example, when you aggregate or join streaming data, you’re required to provide a window, which is a time scope defined by a start and end time that is always moving forward. In contrast, windows are not required for batch processing since the entire data set is processed.

For most teams, their Apache Flink journey begins with core streaming use cases like fraud detection, real-time logistics/inventory, transaction processing, and Customer 360 analytics. They first use tools that they are familiar with, and then often bump up against those tools’ limitations. Here we provide a short recap of what comes next, as well as offer some recommendations grounded in having been through this journey with dozens of similarly situated teams.

Once the team has identified that there is a strong need for real-time stateful stream processing, they evaluate available options for their core engine. Teams most frequently discover Flink is the best fit for the same reasons that Netflix, Uber, Lyft, Airbnb, and others have chosen Flink — chiefly performance, scalability, correctness, and reliability.

At this point, they outline additional specific requirements for a platform built around Flink:

  • Easy to use for internal stakeholders
  • Doesn't overload team with operational burden
  • Uses resources wisely
  • Integrates with everything (interoperable)
If data scientists can copy & paste SQL from Snowflake or DBT and it just works, that would be magical.

Stages of Implementation

Once the team downloads and begins the journey of implementing Flink, they start to experience its complexity. In our experience, teams adopting Flink go through these stages:

Stages of Apache Flink Implementation

Stage 3 or 4 is where many firms either face blocking challenges or become mired in troubleshooting issues. Firms who have multiple people on their team with previous Flink experience have a much higher chance of successfully exiting these early stages. However, stage 5 requires a concerted investment by the executive team to properly staff and support their Flink deployment. This is the stage where Flink must be turned into a platform. It is important to appreciate that Flink works differently than technologies like Kafka Streams or Apache Beam. Platform engineers may find their previous knowledge and experience with those technologies fails them when working to adapt to Flink.

To be successful in creating, maintaining, and gaining ongoing business value out of a production-ready stream processing platform, senior leadership in each department must win the hearts and minds of data scientists, software engineers, data engineers, and other users of the platform. We have seen that comprehensive training for everyone involved is a key to success. Unfortunately, in our survey of over 100 companies we have rarely seen firms succeed with Flink based on training alone—the skill gap is often too large. It’s critical to note that to garner this buy-in and achieve the desired results, your engineering team will be asked to develop key user experience improvements. The platform must meet your users’ needs at their current skill level.

All told, the estimated costs for most significant Flink deployments are between $1.5 to 2.5 million to operate, increasing each year as the platform matures, driving a corresponding increase in usage and complexity, with most of that cost being talent and labor.

The Decodable Advantage

Decodable offers an opportunity to de-risk the entire Apache Flink deployment project, accelerate time to value from 12+ months down to 4 months, give back talented employees to more business-focused tasks, and avoid $1 to 2 million per year in costs. Our fully managed Platform as a Service is designed with ease-of-use at its core and drastically reduces the TCO and operational burden. Adopting Decodable gives you an immediate win that delights your software engineers, data engineers, and stakeholders.

Additional suggestions from Decodable’s team of Flink experts:

  1. Interview users on their UX requirements and build that into your project plan. Do not expect users to be successful with generic Flink. Focus especially on the developer experience of building new pipelines, which will largely determine whether the platform experiences successful adoption.
  2. Consider how you will handle hybrid mode for backfill/baseline, as well as how to handle schema evolution of stateful jobs.
  3. Evaluate rigorously whether to support imperative code or Flink SQL. Supporting FlinkSQL instantly solves the problem of tuning the job manager, which can become an endless bottleneck if you support arbitrary code.
  4. Evaluate your ability to hire Flink-experienced engineers. Assigning only Flink novices to this undertaking doesn’t set the team up for success and adds unnecessary risk to the project.
  5. Identify “hero customers” and earn sponsorship from their senior leadership to help push through when the project hits obstacles.
  6. Make a list of connectors that will be required and include this development effort in the project plan.
  7. Engage a vendor partner for Flink knowledge transfer, support, and training.
  8. Consider what ongoing operations and support will look like. Employees who gain Flink experience will become very valuable in the talent market and may look to leave, and are unlikely to want to answer support tickets on an ongoing basis.

Additional Resources

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
David Fabritius

Let's Get Decoding