Weekly Demo
Blog /

Ingesting OSQuery Logs To Apache Pinot

Hubert Dulay
Decodable

Previously, we demonstrated how to use an osquery extension to send event logs to Apache Pulsar. We then routed, cleansed, and filtered the logs with Decodable using only SQL. In this blog we will feed the cleansed logs into Apache Pinot, a real-time OLAP database. 

Apache Pinot is a real-time on-line analytical processing database (Real-Time OLAP or RTOLAP). RTOLAP databases have the ability to quickly retrieve, aggregate, and analyze data without having to run heavy batch processing. It has native integration to streaming platforms like Kafka, RedPanda, and Pulsar. We’ll be covering additional RTOLAP databases in future blog posts.

RTOLAP databases operate faster with pre-processed datasets that reduce their workloads and enable more focused query execution. Decodable is the ideal solution for performing this pre-processing.

Solution Architecture

The diagram above shows Osquery capturing events on running processes and emitting  them to Apache Pulsar. Decodable subscribes to a Pulsar topic and filters out noisy logs, cleansing the data prior to emitting to the final destination. Details on how this works are in the previous blog: “Routing OSQuery Events via Apache Pulsar"

The final stage involves writing cleansed osquery logs to Apache Pinot. We will then use Apache Superset (a fast and lightweight BI dashboard) to display osquery logs from Apache Pinot.

Decodable lets you aggregate data from multiple streaming platforms including: Apache Pulsar, Apache Kafka, RedPanda, AWS Kinesis. You can assemble a single streaming data pipeline without the need to replicate your data to Kafka. The diagram below shows an enhanced version of this solution that shows osquery logs from laptops going to Apache Pulsar and Docker containers sending theirs to Apache Kafka.

Connect Kafka

Apache Pinot reads directly from streaming platforms to create a materialized view. Our demo will generate two sets of logs from Decodable: cleansed logs and cleansed suspicious logs. We’ll feed both sets of logs into Kafka and then to Apache Pinot.

Next, we configure a Pinot table to read the two sets of logs from Kafka..

We start with the cleansed the osquery logs. Now we will create connections to Kafka for Apache Pinot to read from. Assuming we have a Kafka cluster ready, we will need to create a Decodable sink that writes both sets of logs to Kafka. 

You’ll need to run the following code once for each set of logs: osquery_cleansed_sink and suspicious_osquery_cleansed_sink. You will need to change the CONNECTION_NAME, the STREAM_NAME, and TOPIC for each set of logs.

Configuring a Pinot Table

Next up, we’ll configure a Realtime table in Apache Pinot. Realtime tables ingest data from streams (such as Kafka) and build segments from the consumed data. Pinot has two other types of tables: Offline for batch ingestion and Hybrid for both realtime as well as offline. By default, all tables in Pinot are Hybrid in nature.

First, define the Pinot schema derived from the schema in the sink connection. Both Osquery feeds need a schema. The schemas will be the same but with different schemaNames.

Below is a Jinja template that will generate a Realtime table in Pinot. The template contains parameters that need to be set first.

Parameters:

  • {{SCHEMA}} - the Apache Pinot schemaName.
  • {{TOPIC}} - the topic Apache Pinot will consume from.
  • {{BOOTSTRAP}} - the Kafka bootstrap servers.
  • {{SCHEMA_REGISTRY}} - the schema registry url.
  • {{CONFLUENT_KEY}} - the Confluent Cloud key to your Kafka cluster.
  • {{CONFLUENT_SECRET}} - the Confluent Cloud secret.
  • {{CONFLUENT_SR_KEY}} - the Confluent Cloud key for schema registry.
  • {{CONFLUENT_SR_SECRET}} - the Confluent Cloud secret for schema registry.

You can use Jinja to populate these parameters in the template to generate the Pinot table configuration. Create environment variables for each of the Jinja parameters and run the command below twice for each topic: osquery logs and suspicious osquery logs. Save the output of each run to us in Apache Pinot.

Instead of going to the Apache Pinot dashboard, we will configure our tables using the pinot-admin CLI tool. You can install the CLI using brew here.

Obtain the Apache Pinot controller host name and port from the Pinot console by clicking this button at the home page.

Run the command below supplying the controller host and port. Run it for each of the Pinot table configurations you generated.

This should generate two Realtime tables in Pinot as well as add two schemas. 

Next click on “Query Console” on the left navigation bar and select osquery to view the osquery logs.

Configuring Apache Superset

To configure Apache Superset to read from Apache Pinot, create a new database and select Apache Pinot from the drop-down.

Place the URL using this format: 

pinot://pinot-broker:8099/query/sql?controller=http://pinot-controller:9000/

Next add a dataset choosing Pinot, schema, and table. The drop-downs will prepopulate.

From here, you can create a chart and a dashboard. 

You can also configure the dashboard to increase refresh rate to see realtime osquery logs appear. 

Summary

In this post, we exported osquery logs and persisted them to Apache Pinot, a real-time OLAP (RTOLAP) database. Decodable pre-processed the logs to a flat format and cleansed them to reduce the workload for Apache Pinot so that it can focus on serving real-time data with low latency. If you would have any questions or would like help in implementing this solution, please contact us at support@decodable.co.

Video Demo

Watch a video of this demo:


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires. 

Learn more:

Join the community Slack

Demo Day - Connect Kafka to S3

Eric Sammer and Tim James demonstrate how Decodable connects Kafka to S3 by way of Decodable's own dogfooding - using for internal metrics - in the context of how Decodable connects to a range of systems including Redis, AWS Kinesis, Pulsar, RedPanda, RedShift, Snowflake, Snowpipe, Apache Pinot/StarTree and more.

Learn more

Demo Day: Confluent Cloud with Airline data

In this week's video, Arlo Purcell shows how easy it is to connect to Confluent Cloud as both a sink and source, with auto-detection and import of a complex schema on both ends. In this demo he's transforming Airline industry standard SSIM XML data representing flight schedules.

Learn more

Ingesting Covid Data Into Apache Druid

Apache Druid is a popular realtime online analytical processing database (RTOLAP). In this blog we'll show how to use Decodable to ingest COVID19 global statistics into Apache Druid for visualization in a dashboard.

Learn more

Routing OSQuery Events via Apache Pulsar

OSQuery is an open source tool that lets you query operating system events using SQL.The events can be fed into a streaming platform, in this case Pulsar, for subsequent transformation and routing on the stream using Decodable.

Learn more

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Tags
Pintrest icon in black
Demo
Development
pinot

Start using Decodable today.