Many computing concepts from the 1970’s have fallen out of currency, but not so with ETL (Extract-Transform-Load), and its recent anagram shuffle ELT which manipulates data at destination vs in flight. ETL and ELT are traditionally scheduled batch operations, but as the need for always-on, always-current data services becomes the norm, realtime ELT operating on streams of data is the goal of many organizations - if not the reality, yet.
In real world usage, the ‘T’ in ETL represents a wide range of patterns assembled from primitive operations. In this blog we’ll explore these operations and see examples of how they’re implemented as SQL statements.
Transformations using SQL statements?
Yes! SQL combines the power & conciseness of a declarative language with ubiquity of skills among anyone that’s worked with code or data. Unlike almost any programming language you might use as an alternative, SQL’s has ubiquity thanks to the almost 50 years longevity - pretty much everyone in the computing industry has used it at some point. The power and pervasive nature of SQL means that it’s in use everywhere, even at companies building the latest developer technologies and services. SQL becomes even more powerful when enhanced by functions - which we’ll cover in a future blog post.
Most ETL pipelines fit one or a combination of patterns. Decodable’s Connection - Stream - Pipeline abstraction means you can choose to build everything into a single pipeline, or decompose sophisticated transformations into a network of reusable pipelines connected by streams, spanning teams, regions and use cases as needed.
Filters remove unwanted records from a stream, dropping records that don’t match the ‘rules’ in the SQL where clause. A filter is often used to suppress sensitive records for compliance, or to reduce processing load or storage needs on a target system.
The Route pattern creates multiple output streams from one or more input streams, directing records to their correct destination based on a set of rules. This pattern actually consists of multiple filters which all see every input record but each filter only transmits those records that match the rule for that particular destination.
Transforming pipelines create an output record by modifying the input record. Typically this will result in a 1:1 transmission, but in some cases the output is derived from more than one input record, so there could be a 1:many relationship. Here we’ll call out three specialized transformations:
Parse the incoming record, extract data from the input record and use it as the basis for enriching the derived output record.
Incoming data records often need to be normalized against a schema for a target system to process them. Missing fields may need default values populated, optional fields may need to be stripped out, and data types enforced.
Anonymizing pipelines simply eliminate sensitive fields for compliance, regulatory, or privacy reasons where the target system doesn’t need the information to complete processing.
Aggregation pipelines typically use the SQL window function to group incoming records into buckets - typically based on time - on which to perform the aggregation operation. Count, Min, Max, Avg, Sum are the typical operators, but there are many more.
Our final pattern is the trigger. Unlike almost all of the other patterns, the trigger output record probably has little overlap with the input record’s schema as it indicates that a set of conditions has been detected over one or more input records, and outputs an alert as a result. The output schema could represent the detected condition(s), action to be taken or both.
Why batch when you can stream?
We’ve only skimmed the surface of what’s possible building streaming ETL solutions with SQL in Decodable. Justifying batch transformations in your next data project just got that much harder!
You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.