Back
June 16, 2025
5
min read

Operating Flink Is Hard: What does this really mean? And how to go about it?

By
Sharon Xie
Share this post
Operating Flink Is Hard: What does this really mean? And how to go about it?

You may have also heard phrases like "Operating Flink is tough" or "If you think operating Kafka is hard, wait till you meet Flink" in the past.

After over seven years of operating data platforms backed by Flink, I can confirm that Flink is infamous for being tricky. But before you dismiss it outright, let me break down where the complexity comes from and ways to turn that perceived headache into a manageable routine.

Note that this post focuses exclusively on Flink’s streaming mode, which is its most common deployment and the area with the most difficult operational challenges.

Flink SQL != Database Query

Apache Flink has come a long way toward making stream processing accessible. Flink SQL, in particular, lets engineers focus on business logic instead of wrestling with low-level APIs. But the simplicity brings some confusion. Many people find out—often too late—that running Flink SQL jobs is not the same as running an ad-hoc database SQL query. 

Unlike a typical database query that runs over a finite dataset and produces the result, a Flink query is a continuously running, (stateful) data processing job. Its performance and stability depend on factors that can change momentarily: the rate and shape of incoming data, the state size, and the resources allocated to the job. 

Treat Flink Jobs Well … Just Like Microservices

The quote below resonated with a lot of stream processing enthusiasts, and it says a lot about how powerful stream processing is.

However, in our context here, I want to point out that the reverse is also true!

Running a Flink job is similar to running an application service, both are long-running and driven by incoming activities. A microservice is a server process that exposes its APIs to accept and handle incoming requests. A Flink stream processing job is also constantly available to process incoming data whenever it arrives. Hence, both are expected to provide predictable latency and reliability guarantees. 

So if you stop at "Is my Flink job producing the right results?" you risk being blindsided by resource exhaustion, backpressure, or unexpected failures once your job runs at scale. In practice, you must think beyond unit tests and local runs. Similar to rolling out a microservice, you need to go through capacity planning, performance testing, and monitoring before going live. Let’s take a closer look.

Capacity Planning and Performance Testing

In an environment that mirrors production, measure latency, throughput, and checkpoint behavior. Simulate load spikes or data skews to see how your Flink job reacts. Are checkpoints completed reliably and timely? Is state growth under control and does state size stay within certain boundaries? While you are at it, take the chance to adjust parallelism, tune RocksDB settings, or optimize joins.

Production Rollout and Monitoring

Once the performance test metrics look solid, it's time to make sure you have health indicators and proper monitors in place—including but not limited to checkpoint frequency, backlog, CPU / memory utilization, and backpressure signals—so you’re not flying blind when running Flink jobs in production.

Treating your Flink jobs like microservices by bringing familiar operational patterns such as integration tests, canary deployments, shadow jobs before upgrades, and rolling restarts to stream processing helps to dramatically reduce surprises.

Choosing And Owning the Right Metrics

Even with a service-oriented mindset, picking the right metrics and alert thresholds for Flink jobs is notoriously difficult. When a Flink job fails, the errors could originate from multiple layers owned by different parties, for example:

  • If the source database the Flink job is reading from has an outage, the team that owns the database should be responsible.
  • If the job itself didn’t retry on intermittent failure, the team that writes the Flink job should make a code change.
  • If the Kubernetes cluster that runs all the Flink jobs loses network connectivity (someone may have changed a security policy, for example), the team that owns this cluster should look into it.

While the actual team structure and their responsibilities differ among companies, and some teams or individuals may wear multiple hats, there are usually three roles involved in the Flink ecosystem. There are the Flink platform engineers to provide the infrastructure and runtime for all Flink jobs. Then, there are Flink application developers, responsible for developing the data processing jobs. Besides, other data infra owners manage various source and sink data systems that Flink jobs read from or write to.

Since engineers don’t like to be paged for the issues they are not responsible for, alerts should be carefully picked based on the right metrics and properly tuned monitors. Akin to how this is done in a microservices architecture, all teams involved in the Flink ecosystem operate their own "services" with clear boundaries, providing uptime SLAs, and dedicated health check monitors.

Let’s break down how each team can monitor their services.

Flink Platform Engineers

Their goal is to provide a stable, observable Flink runtime that meets pre-defined uptime SLAs. Rather than alerting on individual job failures, they should focus on platform-level health, such as monitoring Kubernetes cluster resource utilization, control plane service availability and errors, and fleet-wide Flink job status.

Here at Decodable, we run a platform backed by Flink and we are obligated to provide an uptime SLA for Flink jobs. Admittedly, providing an uptime for Flink jobs that can run custom code and configurations isn’t straightforward. Specifically you don’t want the uptime to go down because of non-platform related issues. We defined the uptime percentage measurement for Flink jobs running in our platform as follows:

The percentage of time a production job can process data without external interruption* from outside the Flink environment. In other words, “uptime” begins when a job has successfully submitted at least one checkpoint and received sufficient CPU and memory to run (under 80% resource utilization).

*external interruption: Any event that causes a job to stop processing outside its runtime. For example, a manual user cancellation, an external system shutdown, or a revoked access token.

Flink already reports uptime related metrics for each job, so we started with:

  • <span class="inline-code">uptime = sum(runningTimeTotal)</span>
  • <span class="inline-code">downtime = sum(restartingTimeTotal + failingTimeTotal + cancellingTimeTotal)</span>
  • <span class="inline-code">uptime Percentage = uptime / (uptime + downtime)</span>

Obviously, this doesn’t consider failures caused by external systems. To account for these, you need to exclude the reporting Flink job group if a failure is caused by external systems. In our case, we have an error-handling logic that would stop a job if it is in a state that requires external intervention. As a result, the Flink job is automatically excluded from the monitoring group.

Flink Application Developers

The Flink application developers own core business logic, error handling, and application-level performance. Monitoring should focus on each job’s liveness and performance, typically grouped into these categories:

Health Indicators

The most reliable application-level “heartbeat” is checkpointing progress. For example, if a job is configured to checkpoint every 30 seconds but suddenly no successful checkpoints can be created any longer, it indicates an unhealthy job.

Performance

We suggest monitoring the job backlog (measured by how far behind the source offsets the job is) whenever possible. Streaming platforms (Kafka, Pub/Sub, Pulsar, Kinesis) and CDC-capable databases usually expose this metric. A consistently growing backlog means the job is unable to keep up with the input data rate which in turn causes end-to-end latency to increase.

In addition, you should track backpressure levels, state size, and checkpoint duration. Rising values in any of these metrics reveal degrading job performance.

Efficiency

‍Monitor CPU, memory and disk utilization of all running pods. In our experience, keeping these metrics below roughly 80 percent is essential for cluster stability.

External Data Systems Owners

These teams should publish their availability metrics, for example, whether a database can accept connections or a Kafka broker is healthy. This calculation is technology-specific and there is plenty of documentation available around this.

With clear team boundaries and explicit ownership, each team can instrument its services with concrete metrics. This approach helps minimize noisy alerts and accelerate incident response.

Wrapping Up

Yes, operating Flink is hard but it’s doable. By incorporating a service-oriented mindset with a proper level of engineering discipline, enforcing clear team boundaries and responsibilities, and obsessively monitoring the right metrics, you’ll see a path to manage the complexity with confidence. Additionally, using a fully-managed Flink offering such as Decodable will alleviate many operational challenges.

After all, there is so much value Flink has to offer for real-time data processing and I hope the intricacies of its operational aspects don’t make you shy away from adoption.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!
Oops! Something went wrong while submitting the form.
Sharon Xie

Sharon is a founding engineer at Decodable. Currently she leads product management and development. She has over six years of experience in building and operating streaming data platforms, with extensive expertise in Apache Kafka, Apache Flink, and Debezium. Before joining Decodable, she served as the technical lead for the real-time data platform at Splunk, where her focus was on the streaming query language and developer SDKs.

You may have also heard phrases like "Operating Flink is tough" or "If you think operating Kafka is hard, wait till you meet Flink" in the past.

After over seven years of operating data platforms backed by Flink, I can confirm that Flink is infamous for being tricky. But before you dismiss it outright, let me break down where the complexity comes from and ways to turn that perceived headache into a manageable routine.

Note that this post focuses exclusively on Flink’s streaming mode, which is its most common deployment and the area with the most difficult operational challenges.

Flink SQL != Database Query

Apache Flink has come a long way toward making stream processing accessible. Flink SQL, in particular, lets engineers focus on business logic instead of wrestling with low-level APIs. But the simplicity brings some confusion. Many people find out—often too late—that running Flink SQL jobs is not the same as running an ad-hoc database SQL query. 

Unlike a typical database query that runs over a finite dataset and produces the result, a Flink query is a continuously running, (stateful) data processing job. Its performance and stability depend on factors that can change momentarily: the rate and shape of incoming data, the state size, and the resources allocated to the job. 

Treat Flink Jobs Well … Just Like Microservices

The quote below resonated with a lot of stream processing enthusiasts, and it says a lot about how powerful stream processing is.

However, in our context here, I want to point out that the reverse is also true!

Running a Flink job is similar to running an application service, both are long-running and driven by incoming activities. A microservice is a server process that exposes its APIs to accept and handle incoming requests. A Flink stream processing job is also constantly available to process incoming data whenever it arrives. Hence, both are expected to provide predictable latency and reliability guarantees. 

So if you stop at "Is my Flink job producing the right results?" you risk being blindsided by resource exhaustion, backpressure, or unexpected failures once your job runs at scale. In practice, you must think beyond unit tests and local runs. Similar to rolling out a microservice, you need to go through capacity planning, performance testing, and monitoring before going live. Let’s take a closer look.

Capacity Planning and Performance Testing

In an environment that mirrors production, measure latency, throughput, and checkpoint behavior. Simulate load spikes or data skews to see how your Flink job reacts. Are checkpoints completed reliably and timely? Is state growth under control and does state size stay within certain boundaries? While you are at it, take the chance to adjust parallelism, tune RocksDB settings, or optimize joins.

Production Rollout and Monitoring

Once the performance test metrics look solid, it's time to make sure you have health indicators and proper monitors in place—including but not limited to checkpoint frequency, backlog, CPU / memory utilization, and backpressure signals—so you’re not flying blind when running Flink jobs in production.

Treating your Flink jobs like microservices by bringing familiar operational patterns such as integration tests, canary deployments, shadow jobs before upgrades, and rolling restarts to stream processing helps to dramatically reduce surprises.

Choosing And Owning the Right Metrics

Even with a service-oriented mindset, picking the right metrics and alert thresholds for Flink jobs is notoriously difficult. When a Flink job fails, the errors could originate from multiple layers owned by different parties, for example:

  • If the source database the Flink job is reading from has an outage, the team that owns the database should be responsible.
  • If the job itself didn’t retry on intermittent failure, the team that writes the Flink job should make a code change.
  • If the Kubernetes cluster that runs all the Flink jobs loses network connectivity (someone may have changed a security policy, for example), the team that owns this cluster should look into it.

While the actual team structure and their responsibilities differ among companies, and some teams or individuals may wear multiple hats, there are usually three roles involved in the Flink ecosystem. There are the Flink platform engineers to provide the infrastructure and runtime for all Flink jobs. Then, there are Flink application developers, responsible for developing the data processing jobs. Besides, other data infra owners manage various source and sink data systems that Flink jobs read from or write to.

Since engineers don’t like to be paged for the issues they are not responsible for, alerts should be carefully picked based on the right metrics and properly tuned monitors. Akin to how this is done in a microservices architecture, all teams involved in the Flink ecosystem operate their own "services" with clear boundaries, providing uptime SLAs, and dedicated health check monitors.

Let’s break down how each team can monitor their services.

Flink Platform Engineers

Their goal is to provide a stable, observable Flink runtime that meets pre-defined uptime SLAs. Rather than alerting on individual job failures, they should focus on platform-level health, such as monitoring Kubernetes cluster resource utilization, control plane service availability and errors, and fleet-wide Flink job status.

Here at Decodable, we run a platform backed by Flink and we are obligated to provide an uptime SLA for Flink jobs. Admittedly, providing an uptime for Flink jobs that can run custom code and configurations isn’t straightforward. Specifically you don’t want the uptime to go down because of non-platform related issues. We defined the uptime percentage measurement for Flink jobs running in our platform as follows:

The percentage of time a production job can process data without external interruption* from outside the Flink environment. In other words, “uptime” begins when a job has successfully submitted at least one checkpoint and received sufficient CPU and memory to run (under 80% resource utilization).

*external interruption: Any event that causes a job to stop processing outside its runtime. For example, a manual user cancellation, an external system shutdown, or a revoked access token.

Flink already reports uptime related metrics for each job, so we started with:

  • <span class="inline-code">uptime = sum(runningTimeTotal)</span>
  • <span class="inline-code">downtime = sum(restartingTimeTotal + failingTimeTotal + cancellingTimeTotal)</span>
  • <span class="inline-code">uptime Percentage = uptime / (uptime + downtime)</span>

Obviously, this doesn’t consider failures caused by external systems. To account for these, you need to exclude the reporting Flink job group if a failure is caused by external systems. In our case, we have an error-handling logic that would stop a job if it is in a state that requires external intervention. As a result, the Flink job is automatically excluded from the monitoring group.

Flink Application Developers

The Flink application developers own core business logic, error handling, and application-level performance. Monitoring should focus on each job’s liveness and performance, typically grouped into these categories:

Health Indicators

The most reliable application-level “heartbeat” is checkpointing progress. For example, if a job is configured to checkpoint every 30 seconds but suddenly no successful checkpoints can be created any longer, it indicates an unhealthy job.

Performance

We suggest monitoring the job backlog (measured by how far behind the source offsets the job is) whenever possible. Streaming platforms (Kafka, Pub/Sub, Pulsar, Kinesis) and CDC-capable databases usually expose this metric. A consistently growing backlog means the job is unable to keep up with the input data rate which in turn causes end-to-end latency to increase.

In addition, you should track backpressure levels, state size, and checkpoint duration. Rising values in any of these metrics reveal degrading job performance.

Efficiency

‍Monitor CPU, memory and disk utilization of all running pods. In our experience, keeping these metrics below roughly 80 percent is essential for cluster stability.

External Data Systems Owners

These teams should publish their availability metrics, for example, whether a database can accept connections or a Kafka broker is healthy. This calculation is technology-specific and there is plenty of documentation available around this.

With clear team boundaries and explicit ownership, each team can instrument its services with concrete metrics. This approach helps minimize noisy alerts and accelerate incident response.

Wrapping Up

Yes, operating Flink is hard but it’s doable. By incorporating a service-oriented mindset with a proper level of engineering discipline, enforcing clear team boundaries and responsibilities, and obsessively monitoring the right metrics, you’ll see a path to manage the complexity with confidence. Additionally, using a fully-managed Flink offering such as Decodable will alleviate many operational challenges.

After all, there is so much value Flink has to offer for real-time data processing and I hope the intricacies of its operational aspects don’t make you shy away from adoption.

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

Sharon Xie

Sharon is a founding engineer at Decodable. Currently she leads product management and development. She has over six years of experience in building and operating streaming data platforms, with extensive expertise in Apache Kafka, Apache Flink, and Debezium. Before joining Decodable, she served as the technical lead for the real-time data platform at Splunk, where her focus was on the streaming query language and developer SDKs.

Let's get decoding

Decodable is free. No CC required. Never expires.