Go to most data conferences, and you’ll find people talking about stream processing. In this article, we will explore the reasons behind the widespread discussion of stream processing, and what platform architects and engineers need to know to navigate the streaming stack. We’ll start by discussing the “why” behind its rise, and then we’ll delve into the high-level architecture that forms the backbone of stream processing.
Reasons for the Adoption of Streaming Technology
Data can have immense value when analyzed immediately. New use-cases and higher user expectations are pushing the technology limits. A big driver of streaming technology adoption is the demand from users to support new use-cases, and the expectation for low latency, real-time experiences. Some examples for new use-cases are recommender systems, which need to be able to quickly adjust recommendations for new users, to keep them on the platform. Other examples are ride-sharing or food delivery services – customers increasingly expect real-time visibility into their orders.
Besides these user-facing requirements, internal users ask for real-time analytics (Pinterest: Real-time experiment analytics). We are also seeing more and more cases where stream processing is used for automated actions based on events, or even powering entire social networks by computing their state with a streaming application.
Reasons For the Rise of Stream Processing Technology
Cheap real-time data analysis for everyone. Stream processing enables the immediate analysis of large amounts of data on commodity hardware. Open source technologies that are capable of large-scale, immediate data processing are now available, making stream processing accessible to nearly all organizations.
Prior to modern stream processing, you had to make a choice: opt for small scale, immediate analysis somewhere in your data stack, or settle for large scale analysis in your warehouse, with long delays. Another option was spending a lot of money on the development of bespoke software for large-scale real-time data analysis, as seen in high frequency trading or credit card fraud detection.
Data is produced in real time; why not process it immediately? If you could understand what your business is doing right now, why wouldn’t you? The constraints that batch processing forces on data availability are purely artificial ones, brought about by technical limitations of software and hardware in the past.
Easy to use APIs and well understood patterns enable steam processing for everyone. In particular with the availability of streaming SQL, wider audiences can build data pipelines easily. Besides declarative SQL, programming language support is expanding from just JVM-based to Python and other languages.
The Anatomy of Stream Processing
Stream processing can be looked at as having three distinct stages of handling data:
Let’s have a look at each of these in more detail, and consider some examples of each.
Step 1: Gather Data
In order to process a stream of data, you first need that stream of data! Luckily, almost all data is produced in a continuous fashion, persisting data to files is just an artifact of legacy processing systems.
Here are some common sources of real-time data:
- Databases: A popular approach is to read the transaction log of your transactional database to create a stream of change events. This technique is known as Change Data Capture (CDC), and can be done with a tool such as Debezium. All data in your database, and any update will be reflected in real time in your stream processor.
- Application Logs: Use tools such as Vector, Logstash, or the Kafka Appender for Log4j to send your real-time data to Apache Kafka.
- Application events: Use Kafka producers in your application (such as in Spring boot) to send custom events, such as “driver location update”, “order processed” or “lightbulb updated.”
- Machine data: Collected from a fleet of servers or IoT devices.
- Flat files: If you can not get your data in real time and a legacy system is still writing files, you can still ingest and stream changes to those files, for example using Apache Flink’s File Source.
Of course if your data is already in a data stream, such as Kafka, MQTT, ActiveMQ, or other similar technologies, you can just start processing it from there.
Step 2: Process Data
Open-source stream processors such as Apache Flink or Kafka Streams allow you to analyze data in real-time.
There are a number of processing operations which are very common in stream processing use-cases:
- Deduplication of events from a stream.
- Joining of multiple streams, for example matching events with the same key in a time-window, running a continuous real-time join or doing a lookup join on data from a replicated database.
- Aggregations, either running forever (“total count by country”) or in time windows (“total count by country per hour”).
Here’s an example of how these might be used to enrich a real time stream of order information coming from Kafka, with customer reference (‘dimension’) data from a database, to load into Elasticsearch a denormalized view of customer and order data combined:
There are many other ways of processing data in a stream, including:
- Filtering and tuple-at-a-time transformations are used for getting data in the desired shape.
- Database or REST endpoint lookups – e.g., for each event in a stream, we call an external system to enrich the event.
- Complex Event Processing (CEP) for finding patterns in streaming events.
Stream processors, such as Flink, usually provide multiple interfaces for usage, including high-level abstractions such as SQL, Java and Python libraries. In particular the programming abstractions allow for mix-and matching operators, from high level CEP operators to very low-level primitives accessible directly with Java.
Step 3: Present Data
After collecting and processing your data, the final step is to make it available to downstream consumers.
Where the data goes depends on what you’re doing with it. Common examples include:
- Driving real-time applications: after processing the data—perhaps to cleanse it or look for patterns within it on which actions are going to be automatically taken—you’d write it to a streaming platform such as Kafka. From here your applications would consume the messages to drive further actions.
- Data warehousing/lakehousing and long-term storage in general: you can write streams of data to object stores including Amazon S3 (using file formats such as Parquet, or table formats like Apache Iceberg and Delta Lake), or relational databases and cloud data warehouses such as BigQuery, Snowflake, Oracle, etc.
- Real-time analytics: driven by technologies such as Apache Pinot, Clickhouse, and Apache Druid.
- Transactional databases: such as PostgreSQL, MySQL, et al from where the data is used in other applications.
- Specialized datastores: such as graph databases (Neo4j), time-series databases (InfluxDB), text data (Lucene), etc.
Want to try out stream processing for yourself, or learn more?
I’d love to hear from you if you have further questions on how to get started with stream processing, Apache Flink, or Decodable.