ClickHouse is a popular database with nearly 25,000 GitHub stars and stylizing itself as “a free analytics DBMS for big data,” ClickHouse has achieved widespread adoption and helped engineers everywhere perform real type analytics at scale. Benchmarked at 100x faster than Hive or MySQL, ClickHouse is adopted by many engineering teams to serve queries at very low latencies across large datasets. To achieve this performance at scale compared to standard data warehouses like Snowflake, it makes some architectural tradeoffs that users should keep in mind. For this reason, users should consider conforming the data to ClickHouse best practices before ingestion. Decodable makes it simple to prep the data so ClickHouse performs at its best.
Here are 5 situations where you should pre-process your data before ingestion into ClickHouse:
1. Changing Data
When your data is in another database and you need to get it into ClickHouse, some users try to keep data fresh by running frequent batch jobs that run every couple minutes. This can lead to undesirable outcomes like heavy load on the source system, high compute costs, and unpredictable data consistency. A better option is to use Decodable’s easy CDC (change data capture) capability, capturing new data in real-time, formatting it in real-time, and delivering to ClickHouse in milliseconds.
2. Denormalize or Normalize
You face a tradeoff here on performance and cost. While normalized schemas are always more storage efficient (smaller) compared to denormalized ones, they also require joins at query time that can be expensive in both compute dollars and latency. If you need ultra low query latency and want to minimize compute cost, tables can be combined into a single table, unneeded columns can be filtered out, and ingested in this format. On the other hand, if you are not as latency sensitive, want lower storage costs, and want to maintain normalization, you may want to split streaming events into multiple events and route them into multiple tables. Decodable supports both of these ingest strategies.
3. Buffer and DeDuplicate
Despite being a lightning fast database, ClickHouse does not perform well when ingesting rows one at a time (more here). Streaming events one at a time can be costly in terms of processing dollars and performance, so ClickHouse sets their default ingestion size at 1 million rows. Users will benefit from using stream processing to buffer streaming data, dumping it to ClickHouse when the optimal batch size is reached. It’s also important to consider implementing exactly-once in your streaming pipeline, which may provide cost, complexity, and latency advantages over pushing this deduplication processing into ClickHouse. Decodable easily buffers rows and enforces exactly-once by default.
4. Data Formatting
ClickHouse can especially shine if you optimize the data. The performance difference between an UINT32 and FLOAT64 can be 2x, which promises rewards to users who convert data types in flight before ingesting. Reducing cardinality where possible, aggregating data to reduce the number of data points, and grouping where allowed before ingestion will also drastically improve performance. Masking PII and correcting timestamp/datetime types are another frequently required formatting step. In addition, inserting raw data is often not convenient because very specific commands must be used to ensure data is spread across multiple nodes. More examples of recommended data formatting are here. Decodable can easily format for ClickHouse
5. Clean and Verify Data
Records often contain mixed message formats and inconsistent field names. ClickHouse users often utilize a dead letter queue to capture bad records for inspection and automate correction. Pre-processing with Decodbale will guarantee messages are well-formed and consistent before insertion into ClickHouse.
If you are a ClickHouse user today and you have a streaming platform (Kafka, Kinesis, Pulsar, RedPanda, etc), you can log into Decodable.co and easily connect and transform your data today. If you need help, set up a free session with one of our experts here and we’ll assist in creating your pipeline.
You can get started with Decodable for free - our developer account includes enough for you to build a useful pipeline and - unlike a trial - it never expires.
Join the community Slack