Blog
Blog /

5 Clickhouse Ingestion Scenarios That Need Decodable

Charles Harding
Decodable

ClickHouse is a popular database with nearly 25,000 GitHub stars and stylizing itself as “a free analytics DBMS for big data,” ClickHouse has achieved widespread adoption and helped engineers everywhere perform real type analytics at scale.  Benchmarked at 100x faster than Hive or MySQL, ClickHouse is adopted by many engineering teams to serve queries at very low latencies across large datasets. To achieve this performance at scale compared to standard data warehouses like Snowflake, it makes some architectural tradeoffs that users should keep in mind.  For this reason, users should consider conforming the data to ClickHouse best practices before ingestion. Decodable makes it simple to prep the data so ClickHouse performs at its best.

Here are 5 situations where you should pre-process your data before ingestion into ClickHouse:

1. Changing Data

When your data is in another database and you need to get it into ClickHouse, some users try to keep data fresh by running frequent batch jobs that run every couple minutes.  This can lead to undesirable outcomes like heavy load on the source system, high compute costs, and unpredictable data consistency.  A better option is to use Decodable’s easy CDC (change data capture) capability, capturing new data in real-time, formatting it in real-time, and delivering to ClickHouse in milliseconds.  

2. Denormalize or Normalize

You face a tradeoff here on performance and cost.  While normalized schemas are always more storage efficient (smaller) compared to denormalized ones, they also require joins at query time that can be expensive in both compute dollars and latency.  If you need ultra low query latency and want to minimize compute cost, tables can be combined into a single table, unneeded columns can be filtered out, and ingested in this format. On the other hand, if you are not as latency sensitive, want lower storage costs, and want to maintain normalization, you may want to split streaming events into multiple events and route them into multiple tables.  Decodable supports both of these ingest strategies.

3. Buffer and DeDuplicate

Despite being a lightning fast database, ClickHouse does not perform well when ingesting rows one at a time (more here).  Streaming events one at a time can be costly in terms of processing dollars and performance, so ClickHouse sets their default ingestion size at 1 million rows. Users will benefit from using stream processing to buffer streaming data, dumping it to ClickHouse when the optimal batch size is reached. It’s also important to consider implementing exactly-once in your streaming pipeline, which may provide cost, complexity, and latency advantages over pushing this deduplication processing into ClickHouse.  Decodable easily buffers rows and enforces exactly-once by default.

4. Data Formatting

ClickHouse can especially shine if you optimize the data. The performance difference between an UINT32 and FLOAT64 can be 2x, which promises rewards to users who convert data types in flight before ingesting.  Reducing cardinality where possible, aggregating data to reduce the number of data points, and grouping where allowed before ingestion will also drastically improve performance. Masking PII and correcting timestamp/datetime types are another frequently required formatting step. In addition, inserting raw data is often not convenient because very specific commands must be used to ensure data is spread across multiple nodes. More examples of recommended data formatting are here. Decodable can easily format for ClickHouse

5. Clean and Verify Data

Records often contain mixed message formats and inconsistent field names. ClickHouse users often utilize a dead letter queue to capture bad records for inspection and automate correction. Pre-processing with Decodbale will guarantee messages are well-formed and consistent before insertion into ClickHouse.

If you are a ClickHouse user today and you have a streaming platform (Kafka, Kinesis, Pulsar, RedPanda, etc), you can log into Decodable.co and easily connect and transform your data today. If you need help, set up a free session with one of our experts here and we’ll assist in creating your pipeline.


You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.

Learn more:

Join the community Slack

Ingesting Wikipedia Data To Tinybird

In this post, we'll use Decodable to send Wikipedia page changes to Tinybird, a serverless analytical backend for building low-latency APIs for real-time applications. Decodable will transform the data from a complex format to a flat and simple one for easy consumption by Tinybird.

Learn more

Decodable Platform Overview

I joined Decodable about a year ago as one of the founding engineers. One year on, we’ve built a self-service real-time data platform supporting hundreds of accounts in multiple regions. In this blog post, I’ll share why we built Decodable, the design decisions we made, and the platform architecture.

Learn more

The Top 5 Streaming ETL Patterns

ETL and ELT are traditionally scheduled batch operations, but as the need for always-on, always-current data services becomes the norm, realtime ELT operating on streams of data is the goal of many organizations - if not the reality, yet.In real world usage, the ‘T’ in ETL represents a wide range of patterns assembled from primitive operations. In this blog we’ll explore these operations and see examples of how they’re implemented as SQL statements.

Learn more

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Learn more
Tags
Pintrest icon in black
Clickhouse
Tinybird
Development
Partners

Start using Decodable today.