Moving Data from Apache Pulsar to Apache Iceberg

Moving Data from Apache Pulsar to Apache Iceberg

Apache Pulsar is a distributed messaging and event streaming platform that provides low-latency, durable messaging with strong consistency guarantees. Apache Iceberg is a table format for large-scale data lakes that provides efficient data storage and query performance. Both Apache Pulsar and Apache Iceberg are Apache projects designed to handle large-scale data processing, but they serve different purposes - Pulsar focuses on messaging and event streaming, while Iceberg is geared towards data lake management and query optimization.

Apache Pulsar overview

Apache Pulsar is a distributed messaging and event streaming platform that offers high performance and scalability. It provides real-time data processing capabilities, enabling users to publish and subscribe to messages in a reliable and efficient manner.

One of the primary functions of Apache Pulsar is its ability to handle massive amounts of data streams with low latency and high throughput. It supports features such as message deduplication, message replay, and message retention, making it a versatile platform for various use cases.

Apache Pulsar is typically deployed in a distributed architecture, with brokers handling message storage and processing, and clients connecting to these brokers to publish and consume messages. It can be deployed on-premises or in the cloud, offering flexibility and scalability for different applications and workloads.

Apache Iceberg overview

Apache Iceberg is an open source table format for large-scale data processing that offers features like time travel, schema evolution, and data isolation. It provides a way to manage tables in a way that is efficient and scalable, making it easier to work with large datasets in a data lake or cloud storage environment.

Apache Iceberg is typically deployed in data lake environments where there is a need to manage large amounts of data efficiently. It can be used with popular data processing frameworks like Apache Spark and Presto, allowing users to easily query and analyze data stored in Iceberg tables. Iceberg tables are designed to be highly performant and reliable, making them a popular choice for organizations looking to streamline their data management processes.

Create an Apache Pulsar source connector

To create a Source in the Decodable web app, navigate to the Connections page.

  • On the Connections page, click the "New Connection" button in the upper right corner.
  • Select the type (source or sink) and the external system you want to connect to from the list of available connectors.
  • Enter the connectivity configuration and credentials in the dialog that opens and click "Next." The following behavior depends on whether the connector supports connecting to multiple streams or not.

If the chosen connector supports writing to or reading from multiple streams:

  • A dialog appears from which you can configure the mappings between external resources and Decodable streams.
  • Select which resources should be mapped by the connection by checking the corresponding "Sync" checkboxes, and which sinks their data should be sent to.
  • Click "Next," choose a name in the following dialog window, and click "Create" or "Create and start."

More details are available in the Decodable documentation.

Create an Apache Iceberg sink connector

To create a Source in the Decodable web app, navigate to the Connections page.

  • On the Connections page, click the "New Connection" button in the upper right corner.
  • Select the type (source or sink) and the external system you want to connect to from the list of available connectors.
  • Enter the connectivity configuration and credentials in the dialog that opens and click "Next." The following behavior depends on whether the connector supports connecting to multiple streams or not.

If the chosen connector supports writing to or reading from multiple streams:

  • A dialog appears from which you can configure the mappings between external resources and Decodable streams.
  • Select which resources should be mapped by the connection by checking the corresponding "Sync" checkboxes, and which sinks their data should be sent to.
  • Click "Next," choose a name in the following dialog window, and click "Create" or "Create and start."

More details are available in the Decodable documentation.

Processing data in real-time with pipelines

A pipeline is a set of data processing instructions written in SQL or expressed as an Apache Flink job. Pipelines can perform a range of processing including simple filtering and column selection, joins for data enrichment, time-based aggregations, and even pattern detection. When you create a pipeline, you define what data to process, how to process it, and where to send that data to in either a SQL query or a JVM-based programming language of your choosing such as Java or Scala. Any data transformation that the Decodable platform performs happens in a pipeline. To configure Decodable to transform streaming data, you can insert a pipeline between streams. As we saw when creating a Snowflake connector above, pipelines aren’t required simply to move or replicate data in real-time.

Replicating data from systems like Apache Pulsar to Apache Iceberg in real-time allows you to make application and service data available for powerful analytics with up to date data. It’s equally simple to cleanse data in flight so it’s useful as soon as it lands. In addition to reducing latency to data availability, this frees up data warehouse resources to focus on critical analytics, ML, and AI use cases.

Create a pipeline between Apache Pulsar and Apache Iceberg streams

As an example, you can use a SQL query to cleanse the Apache Pulsar to meet specific compliance requirements or other business needs when it lands in Apache Iceberg. Perform the following steps:

  • Create a new Pipeline.
  • Select the stream from Apache Pulsar as the input stream and click Next.
  • Write a SQL statement to transform the data. Use the form: <span class="inline-code">insert into &lt;output&gt; select … from &lt;input&gt;</span>. Click "Next".
  • Decodable will create a new stream for the cleansed data. Click Create and Next to proceed.
  • Provide a name and description for your pipeline and click Next.
  • Start the pipeline to begin processing data.

The new output stream from the pipeline can be written to Apache Iceberg instead of the original stream. You’re now streaming transformed data into Apache Iceberg from Apache Pulsar in real-time.

Learn more about Decodable, a single platform for real-time ETL/ELT and stream processing.

Let's get decoding

Decodable is free. No CC required. Never expires.