Moving Data from Apache Kafka to Apache Iceberg

Learn how to move data from Apache Kafka to Apache Iceberg using source and sink connectors. Create a pipeline for real-time data processing with Decodable.

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start free Talk To An Expert

Heading 2

Apache Kafka is a distributed streaming platform that is used for building real-time data pipelines and streaming applications. It allows for the publishing and subscribing to streams of records, storing them in a fault-tolerant way, and processing them as they occur. On the other hand, Apache Iceberg is an open table format for large-scale data processing that provides efficient table scans and updates. It helps manage large datasets and provides features like schema evolution and time travel.

Both Apache Kafka and Apache Iceberg are popular tools in the data processing ecosystem, but they serve different purposes. Kafka is focused on real-time data streaming and processing, while Iceberg is more geared towards managing and querying large datasets efficiently. Kafka is used for building real-time data pipelines, while Iceberg is used for efficient table scans and updates.

Apache Kafka overview

Apache Kafka is a distributed streaming platform that is designed to handle real-time data feeds and processing. Its primary functions include publishing and subscribing to streams of records, storing streams of records in a fault-tolerant manner, and processing streams of records as they occur.

Apache Kafka is typically deployed in a cluster configuration, where multiple Kafka brokers work together to store and manage data. It is commonly used in scenarios such as real-time analytics, log aggregation, and event sourcing. Kafka can be integrated with various data processing frameworks and tools, making it a versatile solution for handling large amounts of data in real-time.

Apache Iceberg overview

Apache Iceberg is an open table format for large-scale data storage that provides efficient and reliable data management capabilities. It offers features such as schema evolution, time travel, and efficient data pruning, making it ideal for analytics workloads.

Apache Iceberg is typically deployed in distributed data processing frameworks such as Apache Spark and Apache Flink, where it can handle large volumes of data with high performance and scalability. It provides a unified table format that allows users to easily query and manage their data across different storage systems, including cloud storage and Hadoop Distributed File System (HDFS).

Processing data in real-time with pipelines

A pipeline is a set of data processing instructions written in SQL or expressed as an Apache Flink job. Pipelines can perform a range of processing including simple filtering and column selection, joins for data enrichment, time-based aggregations, and even pattern detection. When you create a pipeline, you define what data to process, how to process it, and where to send that data to in either a SQL query or a JVM-based programming language of your choosing such as Java or Scala. Any data transformation that the Decodable platform performs happens in a pipeline. To configure Decodable to transform streaming data, you can insert a pipeline between streams. As we saw when creating a Snowflake connector above, pipelines aren’t required simply to move or replicate data in real-time.

Replicating data from systems like Apache Kafka to Apache Iceberg in real-time allows you to make application and service data available for powerful analytics with up to date data. It’s equally simple to cleanse data in flight so it’s useful as soon as it lands. In addition to reducing latency to data availability, this frees up data warehouse resources to focus on critical analytics, ML, and AI use cases.

Create an Apache Kafka source connector

To create a Source in the Decodable web app, navigate to the Connections page.

On the Connections page, click the "New Connection" button in the upper right corner.
Select the type (source or sink) and the external system you want to connect to from the list of available connectors.
Enter the connectivity configuration and credentials in the dialog that opens and click "Next." The following behavior depends on whether the connector supports connecting to multiple streams or not.

If the chosen connector supports writing to or reading from multiple streams:

A dialog appears from which you can configure the mappings between external resources and Decodable streams.
Select which resources should be mapped by the connection by checking the corresponding "Sync" checkboxes, and which sinks their data should be sent to.
Click "Next," choose a name in the following dialog window, and click "Create" or "Create and start."

More details are available in the Decodable documentation.

Create an Apache Iceberg sink connector

To create a Source in the Decodable web app, navigate to the Connections page.

On the Connections page, click the "New Connection" button in the upper right corner.
Select the type (source or sink) and the external system you want to connect to from the list of available connectors.
Enter the connectivity configuration and credentials in the dialog that opens and click "Next." The following behavior depends on whether the connector supports connecting to multiple streams or not.

If the chosen connector supports writing to or reading from multiple streams:

A dialog appears from which you can configure the mappings between external resources and Decodable streams.
Select which resources should be mapped by the connection by checking the corresponding "Sync" checkboxes, and which sinks their data should be sent to.
Click "Next," choose a name in the following dialog window, and click "Create" or "Create and start."

More details are available in the Decodable documentation.

Create a pipeline between Apache Kafka and Apache Iceberg streams

As an example, you can use a SQL query to cleanse the Apache Kafka to meet specific compliance requirements or other business needs when it lands in Apache Iceberg. Perform the following steps:

Create a new Pipeline.
Select the stream from Apache Kafka as the input stream and click Next.
Write a SQL statement to transform the data. Use the form: <span class="inline-code">insert into <output> select … from <input></span>. Click "Next".

Decodable will create a new stream for the cleansed data. Click "Create" and "Next" to proceed.

Provide a name and description for your pipeline and click "Next".
Start the pipeline to begin processing data.

The new output stream from the pipeline can be written to Apache Iceberg instead of the original stream. You’re now streaming transformed data into Apache Iceberg from Apache Kafka in real-time.

Learn more about Decodable, a single platform for real-time ETL/ELT and stream processing.

Let's get decoding

Decodable is free. No CC required. Never expires.

Start for Free Talk to an Expert Join the Community on Slack

Table of contents