Schema-on-write is a feature many engineers have used to their advantage when building data pipelines. In this blog, we will introduce this idea, and discuss how it fits into tools like dbt’s and Decodable’s opinionated approach to building streaming data pipelines.
Schema On Write
Traditional ETL (extract, transform, load) systems moved data between enterprise systems. They were composed of an application and sometimes a database or data warehouse. The application connected to systems and routed data. They were stateless and didn’t perform any complex transformations to the data. If complex stateful transformations like aggregations (count, sum, avg, etc.) or joins were required, it was deferred to a database or warehouse because transformation is easier there.
Also since SQL is the primary language in a database and is simple to use and understand, many technical and non-technical roles were able to use it to ask questions about their data. Talent is readily available with SQL, which in turn fuels knowledge creation.
A side effect with doing transformations in the database is it enforces schemas when data is written to a table. We call this “Schema-On-Write” where the format and data types of the data are verified before it was allowed to be written to the table. It forces your data to be structured and have relational integrity (i.e. references to other data are valid).
With the advent of big data and massive parallel processing (MPP) like Hadoop and Map-Reduce, we gained the ability to perform stateful transformations on data in the ETL application and before it got to the data warehouse. But it eased the schema-on-write restrictions that the data warehouse provided because the data entered these big data systems unstructured. The idea of “Schema-On-Read” was introduced to allow unstructured data to enter the data lake and be processed with MPP frameworks to apply structure to the data before it reached the data warehouse.
Apache Hive made this entire process a lot easier by abstracting map-reduce to SQL. It allowed engineers to run complex stateful transformations in the data lake on big data without the need of a relational database. But there was still an issue with schema-on-read. It deferred the enforcement of structure downstream to other processes. These processes had to clean the data for it to conform to the schemas. When these schemas changed, all these processes that cleansed and processed the data broke.
The advantage of implementing a schema-on-write is that invalid data is handled nearer the source, protecting downstream processes from invalid data. The engineers that own the downstream processes are likely to be decoupled from the applications that produced the data and are not privy to the schema evolutions occurring. Schema-on-write forces schema evolution in the producing applications to evolve with interoperability and compatibility in mind.
Schema-on-write enforcement isn’t a feature found in many streaming systems like Flink or Kinesis. Decodable’s opinionated approach to building streaming data pipelines with schemas-on-write makes this feel natural. It merges the advantages of the data warehouse with the scale of big data. Schemas are enforced early with fail-fast and intuitive error handling, which creates a more harmonious data environment.
Domain Driven Design (DDD) is a practice for modeling software to match a domain with input from that domain's experts. There are many ways to define your domain models so that it can be consumed by modeling tools. For example some data modeling tools will generate code in the programming language of your choice or define tables in a database. In this example, we will be using JSON schema to define our simplified Employee domain model.
Decodable follows a unique opinionated approach to building streaming data pipelines where it treats schemas that represent your domain model as first class citizens. It is an approach that protects the greatest stakeholders of your data pipeline, your consumers.
In the command below, we create an event stream using the definition from the JSON schema in Decodable for the NameChangeRequest event in the diagram above.
The stream enforces schema-on-write, giving it database-like semantics. In relational databases, when inserting records into a table, the values must conform to the data types defined in the table. If it does not, the insert is rejected and connection is put into a failure state for data producers to investigate. Schema-on-write protects the consumers from handling errors when schemas evolve. This includes all downstream micro-services and databases.
Conversely, schema-on-read is applying a schema after the data has already been written to the data stream. The stream acts as a delineation between the producer and consumer. Everything before the data stream is the producer’s responsibility and everything after is the consumer’s.
In Decodable’s examples repository, we walk through this use case by allowing you to define streams based on your DDD outcomes and scenarios that you may encounter when building data pipelines that use your modes.
In this video I walk through the demo in the same way you can with a free Decodable account.
You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.
Join the community Slack