Blog /

Top 6 Patterns for Ingesting Data Into Druid

Riley Johnson

Most people fold their laundry before putting it into the drawers, for the simple reason that once the object is in the container, it is then constrained by the limits of the container.  Similarly, users working to ingest data into Druid will find that it’s much preferable to pre-process the data. Here are the top 6 reasons to pre-process your streams:

1. Denormalize Whenever You Can

The Apache Druid site states “Denormalize if possible.” Using flat schemas in Druid improves performance significantly by avoiding the need for JOINs at query time. It’s a common practice for data to go through a denormalization and unnesting step before ingestion into Druid.

2. Once and Only Once

Druid’s append-only segment architecture means that when segments are built, they are committed and published in immutable format.  This can present a real problem for data that needs to be updated or is sent out of order.  It’s common for users to use stream processing to enforce exactly-once, correctly ordered data, and also to filter out data that is clearly malformed or incorrect.

3. Join First

Stream-table joins are commonly performed beforehand in order to reduce the need for joins/lookups in Druid.  There are a few reasons for this:

  • From the Druid site: “If you need to join two large distributed tables with each other, you must do this before loading the data into Druid. Druid does not support query-time joins of two datasources.”
  • Druid’s recent addition of joins is still maturing, and DruidSQL joins aren’t feature parity with ANSI SQL joins. They also can be labor-intensive to implement.
  • Druid currently only supports broadcast joins, so users cannot join two large tables as one of them needs to fit into the memory of a single server.
  • Performance degradation: joins result in 3x query latency degradation compared to using a denormalized data set.
  • Database joins are an expensive operation that requires data to be shuffled around. If you can pre-join your data in the stream processor, then you'll end up with a single, pristine table in Druid

4. General Cleanup

General required pre-processing:

  • Druid datasources do not have primary or unique keys, so you may want to consider deduplication using the keys and/or  filtering those out to save storage and memory.   Duplicated records can occur, including in CDC scenarios (even when using Debezium). If your data store doesn't support upsert, you should implement your own de-duplication process.
  • Flatten data (Druid does not support nested data)
  • Modeling time series requires a significant amount of prework. Please see
  • Convert numeric columns to string columns, which have performance tradeoffs and must be specified beforehand.

5. Enrich & Aggregate

Basic Enrichment & Aggregation: One of the main reasons that Flink is still the most popular stream processor is because of its ability to statefully manage tens of terabytes of data in-memory in an efficient and fault tolerant manner. Using Decodable to pre-process will allow you to do things like keep track of the number of items in inventory, then decrement when someone adds the item into their cart, and send the updated number to Druid for a dashboard like “Number of items in-cart where inventory is <100.”

Another common pattern: Use Flink to lower the resolution of the dataset by calculating results over an interval and sending that data to Druid (e.g. number of in-bound and out-bound flights to/from a specific airport over 15 minutes). This eliminates unnecessary complexity, speeds up queries, and reduces the cost from storage and compute.

6. Add a Timestamp

Druid requires a particular attribute to identify as a timestamp column. If your data does not have a timestamp column, or if it doesn’t have an event time column suitable for this particular use, you can add one in with stream processing. You may also require a secondary timestamp.  

If you are a Druid user today and you have a streaming platform (Kafka, Kinesis, Pulsar, RedPanda, etc), you can log into and easily connect and transform your data today. If you need help, set up a free session with one of our experts here and we’ll assist in creating your pipeline.

Other Resources:

You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

We’re Abusing The Data Warehouse; RETL, ELT, And Other Weird Stuff.

By now, everyone has seen the rETL (Reverse ETL) trend: you want to use data from app #1 to enrich data in app #2. In this blog, Decodable's founder discusses the (fatal) shortcomings of this approach and how to get the job done.

Learn more

Ingesting Covid Data Into Apache Druid

Apache Druid is a popular realtime online analytical processing database (RTOLAP). In this blog we'll show how to use Decodable to ingest COVID19 global statistics into Apache Druid for visualization in a dashboard.

Learn more

Decodable Demo @ Netflix

In this video, Decodable's CEO Eric Sammer and founding engineer Sharon Xie walk the real-time data team at Netflix through the Decodable story. Eric sets the scene with a summary of challenges faced by teams looking to adopt real-time data on existing technology platforms, and the rationale for Decodable. Sharon picks up the story with a demo-heavy deep dive into how Decodable works and the art of the possible with this new platform.

Learn more


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Pintrest icon in black

Start using Decodable today.