🧪 Virtual Hands-On Lab: Introduction to Real-time ETL

July 20, 2022

min read

Ingesting OSQuery Logs To Apache Pinot

Share this post

Previously, we demonstrated how to use an osquery extension to send event logs to Apache Pulsar. We then routed, cleansed, and filtered the logs with Decodable using only SQL. In this blog we will feed the cleansed logs into Apache Pinot, a real-time OLAP database.

Apache Pinot is a real-time on-line analytical processing database (Real-Time OLAP or RTOLAP). RTOLAP databases have the ability to quickly retrieve, aggregate, and analyze data without having to run heavy batch processing. It has native integration to streaming platforms like Kafka, RedPanda, and Pulsar. We’ll be covering additional RTOLAP databases in future blog posts.

RTOLAP databases operate faster with pre-processed datasets that reduce their workloads and enable more focused query execution. Decodable is the ideal solution for performing this pre-processing.

Solution Architecture

The diagram above shows Osquery capturing events on running processes and emitting them to Apache Pulsar. Decodable subscribes to a Pulsar topic and filters out noisy logs, cleansing the data prior to emitting to the final destination. Details on how this works are in the previous blog: “Routing OSQuery Events via Apache Pulsar".

The final stage involves writing cleansed osquery logs to Apache Pinot. We will then use Apache Superset (a fast and lightweight BI dashboard) to display osquery logs from Apache Pinot.

Decodable lets you aggregate data from multiple streaming platforms including: Apache Pulsar, Apache Kafka, RedPanda, AWS Kinesis. You can assemble a single streaming data pipeline without the need to replicate your data to Kafka. The diagram below shows an enhanced version of this solution that shows osquery logs from laptops going to Apache Pulsar and Docker containers sending theirs to Apache Kafka.

Connect Kafka

Apache Pinot reads directly from streaming platforms to create a materialized view. Our demo will generate two sets of logs from Decodable: cleansed logs and cleansed suspicious logs. We’ll feed both sets of logs into Kafka and then to Apache Pinot.

Next, we configure a Pinot table to read the two sets of logs from Kafka..

We start with the cleansed the osquery logs. Now we will create connections to Kafka for Apache Pinot to read from. Assuming we have a Kafka cluster ready, we will need to create a Decodable sink that writes both sets of logs to Kafka.

You’ll need to run the following code once for each set of logs: osquery_cleansed_sink and suspicious_osquery_cleansed_sink. You will need to change the CONNECTION_NAME, the STREAM_NAME, and TOPIC for each set of logs.

‍

Configuring a Pinot Table

Next up, we’ll configure a Realtime table in Apache Pinot. Realtime tables ingest data from streams (such as Kafka) and build segments from the consumed data. Pinot has two other types of tables: Offline for batch ingestion and Hybrid for both realtime as well as offline. By default, all tables in Pinot are Hybrid in nature.

First, define the Pinot schema derived from the schema in the sink connection. Both Osquery feeds need a schema. The schemas will be the same but with different schemaNames.

‍

Below is a Jinja template that will generate a Realtime table in Pinot. The template contains parameters that need to be set first.

Parameters:

{{SCHEMA}} - the Apache Pinot schemaName.
{{TOPIC}} - the topic Apache Pinot will consume from.
{{BOOTSTRAP}} - the Kafka bootstrap servers.
{{SCHEMA_REGISTRY}} - the schema registry url.
{{CONFLUENT_KEY}} - the Confluent Cloud key to your Kafka cluster.
{{CONFLUENT_SECRET}} - the Confluent Cloud secret.
{{CONFLUENT_SR_KEY}} - the Confluent Cloud key for schema registry.
{{CONFLUENT_SR_SECRET}} - the Confluent Cloud secret for schema registry.

‍

You can use Jinja to populate these parameters in the template to generate the Pinot table configuration. Create environment variables for each of the Jinja parameters and run the command below twice for each topic: osquery logs and suspicious osquery logs. Save the output of each run to us in Apache Pinot.

‍

Instead of going to the Apache Pinot dashboard, we will configure our tables using the pinot-admin CLI tool. You can install the CLI using brew here.

Obtain the Apache Pinot controller host name and port from the Pinot console by clicking this button at the home page.

Run the command below supplying the controller host and port. Run it for each of the Pinot table configurations you generated.

‍

This should generate two Realtime tables in Pinot as well as add two schemas.

Next click on “Query Console” on the left navigation bar and select osquery to view the osquery logs.

Configuring Apache Superset

To configure Apache Superset to read from Apache Pinot, create a new database and select Apache Pinot from the drop-down.

‍

Place the URL using this format:

pinot://pinot-broker:8099/query/sql?controller=http://pinot-controller:9000/

Next add a dataset choosing Pinot, schema, and table. The drop-downs will prepopulate.

From here, you can create a chart and a dashboard.

‍

You can also configure the dashboard to increase refresh rate to see realtime osquery logs appear.

Summary

In this post, we exported osquery logs and persisted them to Apache Pinot, a real-time OLAP (RTOLAP) database. Decodable pre-processed the logs to a flat format and cleansed them to reduce the workload for Apache Pinot so that it can focus on serving real-time data with low latency. If you would have any questions or would like help in implementing this solution, please contact us at support@decodable.co.

‍

Video Demo

Watch a video of this demo:

You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Read the docs
Check out our other blogs
Subscribe to our YouTube Channel

Join the community Slack

‍

📫 Email signup 👇

Did you enjoy this issue of Checkpoint Chronicle? Would you like the next edition delivered directly to your email to read from the comfort of your own home?

Simply enter your email address here and we'll send you the next issue as soon as it's published—and nothing else, we promise!

👍 Got it!

Oops! Something went wrong while submitting the form.

Decodable Team

July 12, 2022

min read

Ingesting Covid Data Into Apache Druid

Decodable Team

Powered by Apache Flink and Debezium, Decodable is a real-time data platform that unifies ELT, ETL, and stream processing.

Start Free Talk To An Expert

Heading 2

RTOLAP databases operate faster with pre-processed datasets that reduce their workloads and enable more focused query execution. Decodable is the ideal solution for performing this pre-processing.