There’s been quite a bit of talk about data meshes recently, both in terms of philosophy and technology. Unfortunately, most of the writing on the subject is thick with buzzwords, targeted toward VP and C-level executives, unparsable to engineers. The motivation behind the data mesh, however, is not only sound but practical and intuitive.
Data Mesh In a Nutshell
Data, and its associated infrastructure and operations, are naturally distributed. In most cases, data is produced by a particular team and service, and consumed or used by one or more teams and services. The data produced tends to be highly specific to the source services: its format, structure, how and when it’s made available to external services, privileges required to access it - the list goes on and on. The team that owns the service producing this data is, of course, intimately familiar with all of these idiosyncrasies.
Meanwhile, services that need to consume this data - of which there may be many - have an entirely different set of requirements and idiosyncrasies with which to contend. These services are frequently managed by different teams with different domain and technical expertise, priorities, and motivations. The question is how to bridge these worlds.
The core issue is that we’ve historically organized teams around technologies rather than domain-specific knowledge of the data and its associated processing and operations. You’ll commonly find teams responsible for “the data warehouse” or “the data pipelines,” but having little understanding of the data that lives within it or flows through them, respectively. Instead, knowledge about the data being produced lives with the teams that produced the data, and the details about how data needs to be used rightly lives with the teams consuming the data. By the time data gets to consumers, quality, lineage, usability, discoverability, and security have all degraded, and may even be lost entirely.
The fix for this proposed by the data mesh is to push ownership, responsibility, and infrastructure for data as close to the source as possible. The source team, then, makes this data available as a service to everyone else. By doing this, the people with the most knowledge about the data are in the right position to care for it. Practically, this means that a team that owns a “domain” or functional area is responsible for making its data available to those who need it.
If you squint, this looks identical to the (micro)service and API movement. Teams with the most domain expertise build a service with whatever tech makes the most sense in their world and provide an API - typically REST+JSON these days - to the outside world. Anyone with access to those APIs is just another customer. In many ways, the data mesh does exactly this for data rather than business logic. Just as with microservices, this distributes functionality over multiple teams and technologies rather than a single centralized team. All of the same advantages and disadvantages apply when doing this with data, but the emergent wisdom is that it’s worth it.
But what does it mean to “provide data as a service?” In all but the simplest cases, it’s not the same as exposing functionality from a microservice via a REST API. Data has a different set of standards for access, tends to be far larger, and can have more complex processing requirements than operational APIs.
The goals are to:
- Provide a standard way of serving and consuming data that works for all use cases
- Allow teams to self-serve: to provide or consume data without centralized support
- Work well with existing technologies, processes, and requirements
A Standard API for Data Products
In recent years, messaging has proven to be the best general-purpose “API” for data products. Modern messaging systems are low latency (1-10ms), scale through partitioning, provide clear guarantees on durability and delivery, and are easy to get as a service. Apache Kafka, Apache Pulsar, AWS Kinesis, GCP Pub/Sub, Azure EventHubs, and more are all in widespread use today. While messaging handles real-time applications and use cases, it’s straightforward to additionally write data to object stores, key/value stores, operational database systems, and analytical database and data warehouse systems. This ability to support both online and offline systems makes messaging the natural medium for serving data to customers. By handling the most demanding real-time workloads, we’ve solved the simple batch cases as well.
Just as with any good API, producers of data must clearly define the API’s contract. In this context, that must include message format, schema, metadata, and semantic information to consumers. Information such as retention, partitioning, ordering, messaging modes, and other details must be defined as part of the public contract with consumers. Some streams may be private to a team, while others are public and meant to be shared. Like any good API, backward- and forward-compatibility are hot button topics that are both complicated and necessary to address; something worthy of a dedicated post in the future.
With Decodable, we provide this infrastructure for users in the form of streams which are message queues with all of these concerns bundled together. Connections then attach streams to data source and destination systems, handling any necessary translation as you’d expect. Teams can build pipelines to filter, route, transform, enrich, aggregate, or otherwise slice and dice data between streams. It’s common to use pipelines to create curated streams for external teams by filtering non-essential data, fixing data quality issues, normalizing data such that it makes sense to those outside the team, and otherwise making the data ready for use.
When consumers of data require additional processing specific to them, they can build pipelines that further refine data for their specific needs. This is akin to an API client performing business logic after having received an API response. Similarly, this creates a clear separation of concerns between the producer (the API service) and the consumer (the API client). Producers of data ensure the API contract is maintained, while consumers customize and specialize data for their needs.
This simple set of primitives - connections, streams, and pipelines - work together to form the technical architecture and backbone of a data mesh.
Making Teams Self-Service
If the goal of the data mesh is to put teams in a position to control their destiny and success, self-service is critical. Rather than having teams acting as intermediaries - where context and urgency are often lost - producers of data create and serve their data directly to their customers. These customers, in turn, can tailor data to their needs without the game of telephone that so commonly occurs. Producers of data no longer have to build and maintain specialized infrastructure from which they see no value, and consumers don’t have to depend on someone else to get what they need. This trend tracks and parallels engineers being on the hook for deploying and running their own services.
The challenge is how to make all producers and consumers of data equally effective when they have different use cases, degrees of technical sophistication, different tech stacks and tools, and have different time constraints.
Here’s what we’ve seen make teams successful:
Provide a single way of producing and consuming data. Have a standard way of making data available, discoverable, and usable by customers is critical. Decodable provides exactly this platform.
Make adoption dead simple. Teams shouldn’t have to deploy and manage infrastructure to serve or consume data. That infrastructure must be provided as a service, just as application developers have systems like Kubernetes that handle many of the nasty parts of deploying production applications. We designed Decodable to be as close to “serverless” as you can get.
Speak SQL. To meet customers where they are, you have to speak in terms they understand. Just about everyone knows basic SQL and, in most cases, that’s all that’s needed. Asking people to learn new languages or paradigms is a losing battle, outside of specific domains. SQL is data’s lingua franca. Our goal at Decodable is to bridge those specific domains and create accessibility for everyone, and that’s why we chose SQL as the way of building pipelines.
Be open. Any system that’s about bridging communities of users must be open and accessible to those communities. It sounds obvious, but if it’s hard to connect a system to the data mesh, people won’t do it, at which point you’re losing value. This is another reason we like real-time messaging systems: it’s easy (relatively speaking) to provide robust connectivity to real-time and batch systems. Rather than fighting with teams to get them to re-platform their applications on centralized infrastructure, give them the autonomy to pick what’s right for them, and give them a way to get the data they need. Our job at Decodable is to allow teams to get what they need, in concert with their preferred ecosystem.
The data mesh is a pragmatic response to the complexity of different teams sharing data in a safe, scalable, and repeatable way. We built Decodable to make it easy and fast to get teams on board and effective in building and consuming data products.
Create an account; join our Slack community of platform engineers, data engineers, and data scientists; or just check out our docs. Additionally, check out Zhamak Dehghani’s detailed description of the data mesh.