Stream processing is widely recognized for its ability to continuously and incrementally compute data in motion, delivering faster insights. But where does this technology find its place in your data platform? The short answer is that stream processing serves as the integral engine for data movement between data systems.
In this blog post, I will start with the current state of data movement within a typical data platform, highlighting scaling and management issues. Subsequently, I will illustrate how incorporating stream processing into the architecture enables lower latency, simpler code, and greater flexibility in data system management.
Current State of Data Movement
A robust data platform is the backbone of modern business, ensuring the smooth operation of online services and deriving insights from data. To meet the diverse needs of the business, various systems are used in tandem:
- Operational DBs: For service API requests
- Time-series DBs: For service metrics
- Cache: For quick access to frequently used data
- Full Text Search: For quickly and efficiently searching unstructured text
- Analytical DBs: For business intelligence and reporting
A typical architecture looks like this:
One observation is that the same piece of data is stored multiple times, once in each of these systems. For example, consider an online marketplace. Product information might be stored in Postgres and Redis to handle service requests. The product description might be stored in Elastic to handle full-text search. Parts of the product information might also be stored in Snowflake to support business analytics about products or sales.
All of those systems must be updated and kept in sync whenever a product is changed. Traditional approaches to updating multiple data systems involve synchronous communication and batch processing.
Operational data systems, which offer user-facing functionality like a transactional database and full-text search engine, are updated synchronously with sub-second latency. Analytical data systems, such as data warehouses, are typically updated offline through batch processes, like Spark or FiveTran, usually on an hourly interval. Some systems can reduce the interval to as low as 1 minute. However, in such cases the focus is on data replication and deferring data transformation to a later stage after the data is stored in the destination systems. This practice is known as ELT (Extract, Load, Transform). While using synchronous communication and batch processing to update data systems are easy to start with, there are some real challenges to be called out.
The application code synchronously updates every data system serving the application. The biggest issue is keeping all the systems consistent for which distributed transactions would be needed. The diagram below shows the inconsistency when distributed transactions are not used.
However, not all the data systems support distributed transactions. The application code needs to deal with partial failure, and the decision can vary case to case, leading to complex, bespoke code.
Synchronous communication can also lead to data loss if distributed transactions are not used properly. The diagram below shows that the search index update is lost. You never want to add a product to a system and not have it be searchable!
Another source of potential data loss arises from the use of at-most-once delivery semantics. Logs and metrics are such examples, designed to prevent the service from being blocked when the monitoring system is down. However, this approach can pose challenges, especially when critical information is missing during an incident investigation.
Any of the data systems can become the bottleneck for the service which then impacts its performance. A schema change in the source data requires coordinated adjustments in downstream systems, often involving additional code and downtime, making the process tedious and error-prone.
This issue is specifically about batch processing. Businesses keep pushing for ever lower latencies to serve more timely decision-making. However, batch processing’s periodic and sequential nature hinders further latency reduction.
In larger organizations, various data systems are often owned and managed by separate teams. For example, Personally Identifiable Information (PII) may be present in different data systems owned by Finance, Sales, and Engineering teams. When data movement occurs, such as transferring customer information from the Sales team's CRM system to the financial database, PII compliance is handled on an ad-hoc basis, lacking centralized visibility. Establishing accountability for data quality, security, and compliance becomes challenging across teams. Coordinating efforts to ensure seamless data flow and consistency across diverse platforms demands substantial resources.
In conclusion, data movement is ubiquitous among the various data systems. There are many bespoke ways to move the data that is tailored for the specific system. These solutions not only impact the performance, but also impose hard problems on data consistency, governance, and management.
Stream Processing: Unified Data Movement
Since having various data systems is unavoidable, it makes sense to have a unified way for data movement that is powerful to achieve low latency requirements and easy to apply data governance. How do we do that?
There are three components involved in the data movement: data sources, data transformations, and data consumers. Source data can be in the form of streaming data or data living in a data store. Data from a data store can be turned into a stream of change events using Change Data Capture (CDC). In addition, stream Therefore, in my opinion, data transformation should just work on streaming data. Here is the model:
- Source Connectors: Turn all data into a uniform type - streaming data, with data contracts such as schema, semantics, and SLAs. Enforce data governance here.
- Stream Processing: Continuously process streaming data (mapping, filtering, aggregation, joining, etc.) to meet the expectations of consuming data systems.
- Sink Connectors: Consume processed data streams and put them in the destination systems.
As you can see from the architecture above, data movement is essentially streaming and stream processing. Streaming enables the continuous flow of data, and stream processing processes the data instantaneously and incrementally as it arrives. There are many benefits of this approach.
The most obvious benefit is the reduction of latency since the system can continuously process data as it arrives. Event analytical data systems can be updated with much lower latency (sub-seconds) with transformation happening before the data is stored.
This architecture is very flexible. Adding a data system or adding a new view in a data system means adding another stream processing job and sink connector without changing the application code. Various data systems are eventually consistent because they consume the same events in the same order.
Schema migration can be managed in a unified way. You can first use another stream processing job to write back to the stream with the old schema, maintaining backward compatibility. Then, work on migrating each downstream system to use the new schema. This is the same as adding a new view in the data system - create a new stream processing job to consume the updated stream with the new schema and write to the destination system with a new view. Then you can maintain both views in parallel, run A/B testing and progressively shift clients to the new view. You can gradually phase out the old view, tearing down the old streams and jobs. This approach makes migration much easier to manage and less intimidating. You can also test each step without any impact on the existing system, ensuring zero downtime.
System downtime is isolated and easy to restore. This solves the issues related to synchronous communication across data systems discussed earlier. With stream processing, if one of the data systems is down, the rest of the system remains unaffected. Stream processing keeps track of its processing position, allowing it to resume from the last position once the problematic data system is back online. This ensures no data updates are lost, and consistency is quickly restored..
In a large organization, teams can expose their data through streams with well-defined schemas, accompanied by data governance and latency guarantees. Think of these streams as APIs for the data. On the consuming side, each team maintains their own stream processing jobs to put data into the systems they manage. This loosely coupled approach is good for organization scalability and management.
However, this architecture has its limitations. While the data is processed in real-time, it still incurs sub-second level latency. If lower latency is required, for example in high-frequency trading, then an alternative solution should be considered. The data systems are eventually consistent, and sometimes, this might not be sufficient for certain use cases. For instance, when revoking a user’s access, the aim is to prevent any window of access, and this might not be achieved if the cache is not yet invalidated.
Adopting a unified approach to data movement through streaming, combined with stream processing, offers numerous advantages to data platforms. The architecture not only provides a flexible framework for adding or modifying data systems but also simplifies schema migration without disrupting existing systems. Ultimately, this architecture fosters centralized data governance, encourages effective team collaboration, and enables more real-time decision-making.