Blog /

Ingesting Covid Data Into Apache Druid

Hubert Dulay

Apache Druid

Apache Druid is a real-time analytics database designed for fast OLAP queries on large data sets. Druid powers use cases where real-time and streaming ingestion, fast query performance, and high uptime are important.

In this blog, we will ingest COVID 19 global statistics data into Apache Druid, cleansing the data so that Druid can easily work with it. Once Druid has the COVID data, we will create a real-time dashboard in Apache SuperSet, an open source business intelligence dashboard. Realtime OLAP databases such as Druid operate faster with pre-processed datasets, reducing their workloads and enabling more focused query execution. Decodable is the ideal solution for performing this pre-processing.

Solution Architecture

In this example, a python script gathers COVID statistics, batching them up before sending to Decodable’s REST API. Decodable then performs a set of cleansing tasks to get the data in the format that is optimized for Apache Druid.

The script also runs every hour to get updates to send to Decodable. All of these settings can be externalized and passed into the script. The code for the python script follows:

The python module is called decodablepy which you can install with pip in the command below.

$ pip install decodablepy

To run the script, execute the command below:

This script takes three command line arguments: a token for Decodable REST API, your Decodable account name, and the endpoint of the REST API you deployed. The command above will pull the endpoint value using the Decodable CLI. If you don’t have the Decodable CLI available, you can copy it from the Decodable UI.

Decodable SQL Pipeline

The Decodable pipeline does the work of cleansing the data for Druid. In the SQL below, we are selecting fields from the raw COVID data, filtering fields that are not needed. Then we use Decodable’s built-in temporal function: TO_TIMESTAMP(). This function takes in the column containing the timestamp data and the second parameters tells the function how to parse the data to represent a timestamp. The function changes the data type from string to timestamp.

Configuring Druid

Decodable uses Apache Kafka to feed the cleansed data to Apache Druid. We build a connection in Decodable and write our cleansed data into it. We then configure Apache Druid to read from that topic in Kafka.

To configure Druid, we need to complete the configuration below.

You will need to replace all of the parameters in the curly brackets {{ }} with the values that correspond to your setup. This configuration will be used in Apache Druid. From the top of the Druid UI, click on the button below.

Click on “Start a new spec”. Then click on the Apache Kafka tile.

Next paste the configuration you created into the “consumer properties” box and enter the topic name Decodable is writing into.

Click on “apply” then Next button. Follow the wizard to create a data source accepting all default values until you are complete.

At the top of the Druid UI, click on “query”. You should see your table on the left side. To test, run the Druid query below.

You should see the results similar to this:

Apache Superset Configuration

Use the URL below to point SuperSet to your Druid instance.

druid://{{ USERNAME }}:{{ PASSWORD }}@{{ DRUID HOST }}:{{ DRUID PORT }}/druid/v2/sql

Proceed to create a “DataSet” from Apache Druid choosing the COVID data that Decodable ingested. Next create a World Map chart. You will need to choose a column that contains the CountryCode as the Country Column. Use “Code ISO 3166-1 alpha-2 (cca2)” in the Country Field Type. This will allow SuperSet to map the country location to its corresponding statistics.

To only show the statistics for the last day, you can filter the data by applying this where clause in the Filters input:

Update and save the chart and assign it to an existing dashboard or create a new dashboard. Your dashboard to look similar to the image below.


In this post, we exported captured COVID 19 statistics and persisted them to Apache Druid, a real-time OLAP (RTOLAP) database. Decodable pre-processed the logs to a flat format and cleansed them to reduce the workload for Apache Druid so that it can focus on serving real-time data with low latency. If you would have any questions or would like help in implementing this solution, please contact us at

Video Demo

You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Demo Day - Connect Kafka to S3

Eric Sammer and Tim James demonstrate how Decodable connects Kafka to S3 by way of Decodable's own dogfooding - using for internal metrics - in the context of how Decodable connects to a range of systems including Redis, AWS Kinesis, Pulsar, RedPanda, RedShift, Snowflake, Snowpipe, Apache Pinot/StarTree and more.

Learn more

Decodable Platform Overview

I joined Decodable about a year ago as one of the founding engineers. One year on, we’ve built a self-service real-time data platform supporting hundreds of accounts in multiple regions. In this blog post, I’ll share why we built Decodable, the design decisions we made, and the platform architecture.

Learn more

Demo Day: Connecting Apache Pulsar to Amazon Kinesis

This video shows off the new Decodable Pulsar connector as both a source and sink, followed by the usual Q&A with the team.

Learn more

Demo Day - Intro to Decodable and the new UX

Recorded from the Demo Day on March 10th 2022, Decodable CEO Eric Sammer demonstrates the capabilities of the real-time data engineering platform through the new user experience.

Learn more


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Pintrest icon in black

Start using Decodable today.