Top 6 Patterns for Ingesting Data Into Druid

Share this post

Most people fold their laundry before putting it into the drawers, for the simple reason that once the object is in the container, it is then constrained by the limits of the container. Similarly, users working to ingest data into Druid will find that it’s much preferable to pre-process the data. Here are the top 6 reasons to pre-process your streams:

1. Denormalize Whenever You Can

The Apache Druid site states “Denormalize if possible.” Using flat schemas in Druid improves performance significantly by avoiding the need for JOINs at query time. It’s a common practice for data to go through a denormalization and unnesting step before ingestion into Druid.

2. Once and Only Once

Druid’s append-only segment architecture means that when segments are built, they are committed and published in immutable format. This can present a real problem for data that needs to be updated or is sent out of order. It’s common for users to use stream processing to enforce exactly-once, correctly ordered data, and also to filter out data that is clearly malformed or incorrect.

3. Join First

Stream-table joins are commonly performed beforehand in order to reduce the need for joins/lookups in Druid. There are a few reasons for this:

From the Druid site: “If you need to join two large distributed tables with each other, you must do this before loading the data into Druid. Druid does not support query-time joins of two datasources.”
Druid’s recent addition of joins is still maturing, and DruidSQL joins aren’t feature parity with ANSI SQL joins. They also can be labor-intensive to implement.
Druid currently only supports broadcast joins, so users cannot join two large tables as one of them needs to fit into the memory of a single server.
Performance degradation: joins result in 3x query latency degradation compared to using a denormalized data set.
Database joins are an expensive operation that requires data to be shuffled around. If you can pre-join your data in the stream processor, then you'll end up with a single, pristine table in Druid

4. General Cleanup

General required pre-processing:

Druid datasources do not have primary or unique keys, so you may want to consider deduplication using the keys and/or filtering those out to save storage and memory. Duplicated records can occur, including in CDC scenarios (even when using Debezium). If your data store doesn't support upsert, you should implement your own de-duplication process.
Flatten data (Druid does not support nested data)
Modeling time series requires a significant amount of prework. Please see https://druid.apache.org/docs/latest/ingestion/schema-design.html
Convert numeric columns to string columns, which have performance tradeoffs and must be specified beforehand.

5. Enrich & Aggregate

Basic Enrichment & Aggregation: One of the main reasons that Flink is still the most popular stream processor is because of its ability to statefully manage tens of terabytes of data in-memory in an efficient and fault tolerant manner. Using Decodable to pre-process will allow you to do things like keep track of the number of items in inventory, then decrement when someone adds the item into their cart, and send the updated number to Druid for a dashboard like “Number of items in-cart where inventory is <100.”

Another common pattern: Use Flink to lower the resolution of the dataset by calculating results over an interval and sending that data to Druid (e.g. number of in-bound and out-bound flights to/from a specific airport over 15 minutes). This eliminates unnecessary complexity, speeds up queries, and reduces the cost from storage and compute.

6. Add a Timestamp

Druid requires a particular attribute to identify as a timestamp column. If your data does not have a timestamp column, or if it doesn’t have an event time column suitable for this particular use, you can add one in with stream processing. You may also require a secondary timestamp.

If you are a Druid user today and you have a streaming platform (Kafka, Kinesis, Pulsar, RedPanda, etc), you can log into Decodable.co and easily connect and transform your data today. If you need help, set up a free session with one of our experts here and we’ll assist in creating your pipeline.

‍

Other Resources:

https://druid.apache.org/docs/latest/ingestion/schema-design.html

https://imply.io/blog/druid-nails-cost-efficiency-challenge-against-clickhouse-and-rockset/

https://www.baeldung.com/apache-druid-event-driven-data

You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Read the docs
Check out our other blogs
Subscribe to our YouTube Channel
Join the community Slack

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Decodable Team

We’re Abusing The Data Warehouse - RETL, ELT, And Other Weird Stuff

May 3, 2022

min read

We’re Abusing The Data Warehouse - RETL, ELT, And Other Weird Stuff

Eric Sammer

July 12, 2022

min read