In the transportation sector, it is imperative to know which assets are where, when they arrive or depart from points along their route, and whether or not they are proceeding on schedule. This is especially true for airlines, where customers expect accurate, real-time updates on flight status at every stage of their journey. Being able to extract key insights from complex data streams helps airlines keep their operations flowing smoothly and their passengers well-informed.
In this example, we’ll walk through how the Decodable data service is used to clean, transform, and enrich real-time flight data. The processed data can then be used to update airline websites and status boards in airport terminals.
Below we can see a sample of raw flight event data. In its current form, it is far more complex and detailed than what passengers want to know about their flights and what is needed to update a flight status board in an airport terminal. By using one or more Decodable pipelines, which are streaming SQL queries that process data, we can transform the raw data into a form that is best suited for how it will be consumed.
For this example, only a single pipeline is needed to process the raw incoming data into the desired form. But it is also possible to use multiple pipelines in a series of stages, with the output of each one being used as the input for the next. Depending on the complexity of the desired processing, it can be helpful to break it down into smaller, more manageable steps. This results in pipelines that are easier to test and maintain. Each stage in the sequence of pipelines is used to bring the data closer to its final desired form using SQL queries.
Decodable uses SQL to process data that should feel familiar to anyone who has used relational database systems. The primary differences you’ll notice are that:
- You activate a pipeline to start it, and deactivate a pipeline to stop it
- All pipeline queries specify a source and a sink
- Certain operations, notably JOINs and aggregations, must include windows
Unlike relational databases, all pipelines write their results into an output data stream (or sink). As a result, all pipelines are a single statement in the form INSERT INTO <sink> SELECT ... FROM <source>, where sink and source are streams you’ve defined.
Transform And Enrich Data Stream
For this example, each record of the data stream contains a single flights field, which in turn contains an array of flight data that needs to be unnested (or demultiplexed) into multiple records. To accomplish this, a cross join is performed between the flights-raw data stream and the results of using the unnestfunction on the flights field.
For example, if a given input record contains an array of 5 flights, this pipeline will transform each input record into 5 separate output records for processing by subsequent pipelines.
When the pipeline is running, the effects of unnesting the input records can be seen in the Overview tab which shows real-time data flow statistics. The input metrics will show a given number of records per second, while the output metrics will show a higher number based on how many elements are in the flights array.
With each flight represented by its own record, additional transformations can be performed to clean the input data. For this example, the flight.live.updated field will be converted from a string to atimestamp, which enables more sophisticated processing, such as using SQL group window functions.
The flight status can be further enriched by calculating flight_status a field based on the departure and arrival delay fields, and a flight_stage field based on whether the flight is on the ground and has yet to arrive at its destination.
Pipeline: Extract Flight Data
After creating a new pipeline and entering the SQL query, clicking the Run Preview button will verify its syntax and then fire up a new executable environment to process the next 10 records coming in from the source stream and display the results. Decodable handles all the heavy lifting on the backend, allowing you to focus on working directly with your data streams to ensure that you are getting the results you need.
At this point, a sink connection (one that writes a stream to an external system, such as AWS S3, Kafka, Kinesis, Postgres, Pulsar, or Redpanda) can be created to allow the results to be consumed by your own applications and services.
As we can see from this example, a sophisticated business problem can be addressed in a very straight-forward way using Decodable pipelines. It is not necessary to create docker containers, there is no SQL server infrastructure to set up or maintain, all that is needed is a working familiarity with creating the SQL queries themselves.
You can watch a demonstration of this example on the Decodable YouTube channel.
Additional documentation for all of Decodable’s services is available here.
Please consider joining us on our community Slack.