What are data products?
Data products are like produce in a grocery store. When shopping for produce, I want assurance that the product is clean and not moldy. Also that it’s wrapped and protected from dirt or from kids' germy little fingers. I may look for the organic label so that I can trust the product and so I don’t have any anxiety in consuming or serving it to others. This idea is not any different than data products.
Basically data products are data that has high quality and trustability. They provide the confidence to consumers that what they are consuming is correct, secured, and will not break their applications.
In AI/ML, data scientists will want to use data products because they provide the confidence that the data they are consuming will provide proper insights to the questions their analytics are trying to solve. Especially when these analytics are related to medical decisions, trusting data is critical. Data products provide this trust.
Why data products?
When data engineers think of ETL, we often think of building a batch data pipeline from source to sink with all the required transformations in between. It’s usually requested by data scientists or other LOB stakeholders that just need your data. A lot of the time you’re not even exposed to the end use case (or worse, the SLAs the use case requires). So you’re left to guess the incremental cadence to configure your Airflow DAG. But what if you could just build data products? You wouldn’t actually need to worry about the sink. You would just provide the self-service tools that will give your data customers the ability to consume the data into their own domain.
And what if you provide these data products as real-time streaming data? You wouldn’t have to worry about building and scheduling your Airflow DAG. It would be a continuous real-time feed that would meet most required SLAs. The only part data engineers would be responsible for is sourcing the data and transforming it so that it meets the high quality and trust your data consumers require. You would only do this work once before publishing it for others to consume.
Decodable enables easy publishing of data products as streams. We also enable subscribers to easily consume and bring that streaming data into their own domain.
In the following use case (see github repo), we’ll start by generating some mock input data. We’ll parse and transform it into a format generalized for consumers and we’ll assign a schema to it. After the data is parsed, cleaned, and formatted, we can consider it a data product.
Streaming data products - as easy as REST APIs?
AsyncAPI is an open source initiative with the goal of making streaming architectures as easy and as common as REST APIs.
AsyncAPI provides a standard way of describing asynchronous data (streaming data) in a way that extends OpenAPI that describes REST APIs. The goal is to make streaming architectures as easy as REST APIs. You can extend AsyncAPI to add self-service capabilities that will enable easy integration and consumption of data products published in Decodable.
With AsyncAPI tools, developers can generate and parse an AsyncAPI YAML document to generate client code like Spring Boot or even HTML. You can also parse AsyncAPI to call REST endpoints in Decodable to create sink connections to pull data products into your domain. The tools paired with Decodable makes for easy, low to no-code experience when consuming your data products.
To try this for yourself, check out the samples in the Decodable repository.
Watch me demo AsyncAPI with Decodable from our recent Demo Day:
You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.