Clickstream data is collected from user activity on the web, and is used to provide insight into how visitors get to the website, what they do once there, how long they stay on any given page, the number of page visits visitors make, and the number of unique and repeat visitors. Clickstream analytics have ability to refine data by processing, cleaning, and transforming the raw data into convenient structure that make analysis of data easy and more accurate. Using web data, businesses can not only identify customer needs but can offer customized solutions to cater to the needs of an evolving customer base. The global clickstream analytics market size was valued at $868.8 million in 2018, and is projected to reach $2.5 billion by 2026, indicating a significant focus for businesses.
In this example, we’ll walk through how the Decodable data service is used to clean, transform, and enrich real-time clickstream data. The processed data can then be used to inform business decisions.
Below we can see a sample of raw clickstream data, with one record per page visit for every user of each website monitored. Currently, it is not in the best form for analyzing how well the website is performing. For this, it would be better to have statistics aggregated over time. By using one or more Decodablepipelines, which are streaming SQL queries that process data, we can transform the raw data into a form that is best suited for how it will be consumed.
For this example, only a single pipeline is needed to process the raw incoming data into the desired form. But it is also possible to use multiple pipelines in a series of stages, with the output of each one being used as the input for the next. Depending on the complexity of the desired processing, it can be helpful to break it down into smaller, more manageable steps. This results in pipelines that are easier to test and maintain. Each stage in the sequence of pipelines is used to bring the data closer to its final desired form using SQL queries.
Decodable uses SQL to process data that should feel familiar to anyone who has used relational database systems. The primary differences you’ll notice are that:
- You activate a pipeline to start it, and deactivate a pipeline to stop it
- All pipeline queries specify a source and a sink
- Certain operations, notably JOINs and aggregations, must include windows
Unlike relational databases, all pipelines write their results into an output data stream (or sink). As a result, all pipelines are a single statement in the form INSERT INTO <sink> SELECT ... FROM <source>, where sink and source are streams you’ve defined.
Aggregate And Enrich Data Stream
For this example, the pipeline leverages the SQL tumble group window function to create a set of records across a non-overlapping, continuous window with a fixed duration of 1 hour. For each interval, the number of pages visited, the total amount of time spent reading or interacting with these pages, and an average of how much of the pages were actually viewed is calculated, grouped by website and user.
As an alternative, the hop window function could be used to create a set of records across a fixed duration that hops (or slides) by a given interval. If the hop interval is smaller than the window duration, the hopping windows overlap, and records from the data stream are assigned to multiple windows. Then a subsequent pipeline could be used to filter the results to one representing the highest level of engagement over a set duration for each user for each website.
Pipeline: Aggregate Clickstream Data
After creating a new pipeline and entering the SQL query, clicking the Run Preview button will verify its syntax and then fire up a new executable environment to process the next 10 records coming in from the source stream and display the results. Decodable handles all the heavy lifting on the backend, allowing you to focus on working directly with your data streams to ensure that you are getting the results you need.
At this point, a sink connection (one that writes a stream to an external system, such as AWS S3, Kafka, Kinesis, Postgres, Pulsar, or Redpanda) can be created to allow the results to be consumed by your own applications and services.
As we can see from this example, a sophisticated business problem can be addressed in a very straight-forward way using Decodable pipelines. It is not necessary to create docker containers, there is no SQL server infrastructure to set up or maintain, all that is needed is a working familiarity with creating the SQL queries themselves.
You can watch demonstrations of several examples on the Decodable YouTube channel.
Additional documentation for all of Decodable’s services is available here.
Please consider joining us on our community Slack.