Securing online applications and services is a major requirement for businesses of all types, and threat actors are constantly increasing the sophistication of their attacks. One example of this is mobile device emulators being used to spoof devices and mimic user behavior in an attempt to take over legitimate user accounts. With telemetry data such as accelerator and gyroscope signals, it is possible to train machine learning models to identify fraudulent activity and detect automated bots.
In this example, we’ll walk through how the Decodable data service is used to clean, transform, enrich, and aggregate real-time telemetry data describing a device’s accelerometer and touch screen interactions which is being sent from the Moonsense SDK. The processed data can then be sent onward to a fraud detection model for training or evaluation.
Below we can see a sample of the raw telemetry data from a typical user’s device. In its current form, it is not suitable for use by a machine learning model. By using one or more Decodable pipelines, which are streaming SQL queries that process data, we can transform the raw data into a form that is best suited for how it will be used.
For this example, four separate pipelines are used in series, with the output of each one being used as the input for the next. While it is possible to perform all the desired processing in a single large, complex pipeline, it is most often desirable to split them into smaller, more manageable processing steps. This results in pipelines that are easier to test and maintain. Each stage in the sequence of pipelines is used to bring the data closer to its final desired form using SQL queries.
Decodable uses SQL to process data that should feel familiar to anyone who has used relational database systems. The primary differences you’ll notice are that:
- You activate a pipeline to start it, and deactivate a pipeline to stop it
- All pipeline queries specify a source and a sink
- Certain operations, notably JOINs and aggregations, must include windows
Unlike relational databases, all pipelines write their results into an output data stream (or sink). As a result, all pipelines are a single statement in the form INSERT INTO <sink> SELECT ... FROM <source>, where sink and source are streams you’ve defined.
Clean The Input Data Stream
As with most data services pipelines, the first step is to apply a variety of transformations to clean up and simplify the input data. For this example, the first pipeline is used to parse and restructure the raw data as follows:
- the time fields will be converted from strings of integers representing epoch milliseconds totimestamp fields, which will enable more sophisticated processing in subsequent pipelines
- several fields that are nested inside the complex JSON source object will be elevated to simple top-level fields, which can then be more easily accessed in subsequent pipelines
- only the fields required by subsequent pipelines will be included in the output stream, filtering out extraneous fields and simplifying the data to be processed
Pipeline: Parse Moonsense Data
After creating a new pipeline and copying in the SQL query, clicking the Run Preview button will verify its syntax and then fire up a new executable environment to process the next 10 records coming in from the source stream and display the results. Decodable handles all the heavy lifting on the backend, allowing you to focus on working directly with your data streams to ensure that you are getting the results you need.
Unnest Data Stream Array
To help detect non-human bot activity, the raw pointer data from the parsed telemetry data stream can be analyzed. In order to facilitate that, the pointer_data field, which contains an array of pointer positions, needs to be unnested (or demultiplexed) into multiple records. To accomplish this, a cross join is performed between the moonsense_parsed data stream and the results of using the unnest function on the pointer_data field.
For example, if a given input record contains an array of 5 pointer positions, this pipeline will transform each input record into 5 separate output records for processing by subsequent pipelines.
When the pipeline is running, the effects of unnesting the input records can be seen in the Overview tab which shows real-time data flow statistics. The input metrics will show a given number of records per second, while the output metrics will show a higher number based on how many elements are in the pointer_data array.
Pipeline: Extract Pointer Records
Enrich data stream
In the next stage of pipeline processing, we want to determine how quickly the pointer’s position is changing. Because SQL provides a comprehensive set of powerful functions, such as power and sqrt, we can leverage these to enrich the data stream with the results of calculations that are more useful for subsequent processing.
This somewhat complex SQL query could be broken down into multiple smaller, simpler queries; but it is also possible to create pipelines of arbitrary complexity based on your requirements.
An inner nested select query is used to combine the change in pointer position data from the previous record with the current record using the lag window function, which provides access to a record at a specified physical offset which comes before the current record (in this case that is simply the previous record). A surrounding select query calculates the change in time and position between two consecutive pointer position records. Finally, the outermost select query calculates the pointer velocity and outputs that into a new data stream for processing by the next pipeline.
Pipeline: Calculate Pointer Velocity
Aggregate Data Stream
In this final pipeline stage, the data is aggregated into summary statistics that can then be fed into a detection model for training or evaluation. By leveraging the SQL tumble group window function, a data distribution matrix is created across a non-overlapping, continuous window with a fixed duration of 10 seconds. For each set of records, the number of pointer updates and totals for the interval are calculated.
For this example, we have focused only on the pointer position, but the original data stream contains a wealth of additional information, all of which can be processed in a similar manner.
Pipeline: Calculate Pointer Statistics
Clicking the Run Preview button will begin the 10-second tumble interval and then display the output data stream of this final step of the multi-stage pipeline for this example, as shown below. At this point, a sink connection (one that writes a stream to an external system, such as AWS S3, Kafka, Kinesis, Postgres, Pulsar, or Redpanda) can be created to allow the results to be consumed by your own applications and services.
As we can see from this example, a sophisticated business problem can be addressed in a very straight-forward way using Decodable pipelines. It is not necessary to create docker containers, there is no SQL server infrastructure to set up or maintain, all that is needed is a working familiarity with creating the SQL queries themselves.
You can watch a demonstration of this example on the Decodable YouTube channel.
Additional documentation for all of Decodable’s services is available here.
Please consider joining us on our community Slack.