Reverse ETL - a Backwards Step?
By now, everyone has seen the rETL (Reverse ETL) trend: you want to use data from app #1 (say, Salesforce) to enrich data in app #2 (Marketo, for example). Because most shops are already sending data from app #1 to the data warehouse with an ELT tool like Fivetran, many people took what they think was a shortcut, doing the transformation in the data warehouse and then using an rETL tool to move the data out of the warehouse and into app #2.
The high-priced data warehouses and data lakes, ELT, and rETL companies were happy to help users deploy what seemed like a pragmatic way to bring applications together, even at serious cost and complexity.
First we’ll discuss the (fatal) shortcomings of this approach, offer easy, vendor-agnostic, sustainable patterns to get the job done, and then a few predictions. (Disclosure: I am the founder of a startup in this space.)
The most obvious drawback of this pattern is cost. Setting aside the expense of rETL tools, the cloud data warehouse is probably the most expensive CPU cycle available. On top of that, data warehouse vendors recommend patterns (like DataBricks’ Medallion) that duplicate the data repeatedly: first raw, then filtered/masked, then fine-tuned. Your Cadillac data warehouse becomes a data junkyard with broken down cars and household appliances multiplying by the day. But isn’t this the over-centralization problem that soured everyone on the data lake? Weren’t paradigms like microservices and data mesh designed to avoid exactly this?
And how about data governance? Rather than just piping the data between apps, this architecture multiplies the data at various states of transformation. But what if you didn’t have to store the data at all to move and transform it?
Putting high-priced analytical database systems in the hot path introduces pants-on-head anti-patterns to supportability and ops. Who can say with a straight face that putting a data warehouse between two tier 1 apps is a good idea? Not many shops treat analytical systems - the data warehouse, ELT systems, offline feature stores, visualization systems - like tier 1, business-critical components. Companies don’t replicate analytic tools and data for high availability across availability zones, they don’t (usually) carry pagers, they don’t duplicate processes; it’s an enormous cost and risk. What are we doing?
We’ve accidentally designed our customer experience to rely on slow batch ELT processes.
To make matters worse, rETL is frail and binary. To get data to your destination app, you’re relying on at least 3 different tools to complete their jobs in-order and on time. If any part fails, zero data arrives. If more data arrives in a batch than expected, deadlines are missed potentially leading to unprocessed data by scheduled downstream batch processes.
It’s worth taking a moment to acknowledge that the data warehouse is not an exchange or gateway. Its design center is to store data at scale, and to support things like large analytical queries and visualization tools. Gateways are built for many-to-many relationships, for decoupled no-knowledge apps, for decentralization and access control.
Lastly, let’s talk about timeliness. rETL jobs sometimes run as little as once a day: the data is out of date as soon as it arrives. To wave off this reality, one rETL company helpfully points out that over 80% of their jobs finish under 30 seconds, indicating that users are running the tool so frequently there’s not much data to move each time. This gives away the game: despite what rETL vendors say, the business feels that up-to-date is better than out-of-date, and people are pushing rETL tools, batch processing, and the data warehouse to their limit.
Kappa: a Better Story
The advantages to rETL can be summed up in one word: convenience. Users already had ELT jobs moving data to the data warehouse. The data required to enrich app 1’s data is already there, too. And the skillsets are common: everyone knows how to use Fivetran and SQL. When you mention streaming as an alternative, mostly the response is a variant of "that's too hard: we don't need real time."
But it’s not about real time, is it? It’s about the soaring Snowflake bill. It’s about the massive pile of tech debt that rETL is accumulating. It’s about unraveling the nest of critical dependencies on analytics tools and making support sustainable.
And “too hard” is changing: streaming is quickly reaching ease-of-use parity with rETL. Let me show you how.
Articulated presciently by Jay Kreps in 2014 and proven in recent years by many real world firms, the Kappa architecture is a better way to move and process data. It costs far less, provides tier 1 quality SLAs without the cost of duplicating data, allows for fractional rather than total failures, and gets your data warehouse out of the critical path.
Put simply, Kappa means to pull data from app #1 as it occurs, send it in bite sized chunks to a data gateway, transform/enrich as necessary, and then deliver to all the places it's needed in parallel.
To pull the data, you can use native event queues like those found in Salesforce, native CDC output such as with MongoDB, or CDC connectors. Apps and databases from which you want to source data are increasingly offering native CDC and event data streams, eliminating the complexity of connectors.
For a data gateway, we are now rich in low-ops cloud services with streaming platforms like Kafka, Kinesis, Pulsar, RedPanda, or even SQS.
Reformatting & enriching in-flight has become as easy as batch with no-ops SQL-based tools like Decodable and others. Select, where, insert into, done. Most modern tools can easily enrich data via joins and domain-specific functions.
Delivery is getting easier every day. Databases like Startree/Apache Pinot now ingest streaming data natively, while apps like Salesforce support streaming data ingestion natively. Many streaming services even handle this for you, delivering directly to things like S3. Even the data warehouses vendors are starting to support native streaming ingest capabilities.
The Kappa architecture is natively distributed to protect against failure and enable low-cost replication across availability zones, regions, and clouds for resiliency. Tier 1, we have arrived.
For many data pipelines, we are now at the point of ease-of-use parity between streaming and batch/rETL.
Unlocking the Future
Which brings us to predictions. It seems every week that a new app or database is natively supporting streams as source or sink, reaching toward the data gateway. At the same time, data gateways are working towards the apps and DBs with no-config, native connections.
There's a network effect here. As each app and DB comes online with no-config streaming capability, it unlocks pipelines with all the other components who already have done so. Given that only a few rETL patterns represent most of the data moved (Salesforce, Segment, etc), in the next two years we will reach a tipping point where rETL will be the minority of new pipelines. Customers will free up tons of space in the data warehouse, retire batch based tools, and reclaim large portions of their analytics budgets.
And as an added plus, every app will be real-time, all the time.
Get Started with Decodable
You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.