Feb 10, 2025

Bulk inserts on Clickhouse: How to avoid overstuffing your instance

As we hit the midway point of the second month in 2025, a lot of you might be starting to really dig in on new data initiatives and planning key infrastructure changes to your company’s data stack.

As we hit the midway point of the second month in 2025, a lot of you might be starting to really dig in on new data initiatives and planning key infrastructure changes to your company’s data stack. Among these changes a common theme I’ve noticed recently in the posts on the Clickhouse Slack workspace’s #help channel is fast adoption of Clickhouse Cloud and people attempting to port large workloads from self-hosted Clickhouse over to a managed environment.

The nice thing is that Clickhouse absolutely loves bulk data, but there are still some things to consider when on-boarding your data onto Clickhouse Cloud.

Understanding MergeTree

At its core, MergeTree is Clickhouse’s primary table engine that handles data storage and retrieval. It’s designed to efficiently manage large volumes of data by organizing it into parts that can be merged in the background. This architecture is particularly important to understand when performing bulk inserts, as it directly impacts how your data is stored and optimized.

When performing bulk inserts into Clickhouse, it’s crucial to be aware of how MergeTree handles data merging. Each bulk insert creates a new “part” in the table, and these parts are periodically merged in the background to optimize storage and query performance. However, if too many small inserts are made in quick succession, it can lead to an excessive number of parts, which might overwhelm the background merge process and degrade performance and the infamous too many parts error.

To avoid this, aim to batch your data into larger chunks before inserting. Clickhouse is highly efficient at handling large datasets, so fewer, larger inserts are generally better than numerous small ones. Additionally, consider using tools like INSERT INTO with the FORMAT clause.

Monitoring the system tables, such as system.parts and system.merges, can also give you insights into the health of your MergeTree tables and help you adjust your strategy for bulk inserts as needed.

Source: https://clickhouse.com/docs/en/optimize/bulk-inserts

How to bulk insert effectively

When it comes to bulk inserting data into Clickhouse, there are a few best practices that can make the process smoother and more efficient. Understanding how Clickhouse processes data and how to optimize your inserts can save you time and resources while ensuring optimal performance.

Bigger is better but…

Ideally, you can and should send your data in fewer large inserts than trying to send a bunch of small inserts so that the created parts are merged in the background.

However, there are some practical limits to consider. While Clickhouse can handle very large inserts, you’ll want to find a sweet spot that balances your network bandwidth, memory usage, and processing capabilities. A good starting point is to aim for batch sizes between 50,000 to 1,000,000 rows, depending on your row size and available resources.

Let everything digest a bit 😋

Another important caveat is that much like any good feast, you’ll want to the background merging time to digest the parts that are being created, so the recommended pacing of inserts is no more than one per second.

This pacing allows the background merges to complete efficiently and prevents resource contention. If you’re dealing with particularly large datasets, you might even want to introduce longer delays between batches to ensure smooth processing. A workaround for this that allows for higher insert rates is using async_inserts.

5 Tools for Clickhouse bulk data ingestion

So now we know the high level concepts behind bulk data ingestion in Clickhouse, let’s peek at some of the tools that enable bulk data ingestion (And some replication tools)

1. Jitsu Bulker

Jitsu Bulker is a powerful tool designed for efficient data ingestion into Clickhouse. It enables users to streamline the process of bulk inserting large datasets by batching data into optimal sizes, reducing the risk of creating too many parts. Bulker also supports various data formats and integrates seamlessly with existing data pipelines, making it a versatile choice for teams looking to accelerate their data ingestion workflows.

2. Clickhouse Bulk

clickhouse-bulk is a popular open-source tool that provides a HTTP interface for buffering and batching data before sending it to Clickhouse. It acts as a proxy that aggregates smaller inserts into larger batches, helping to prevent the “too many parts” problem while maintaining high throughput. The tool also includes features like automatic retries and basic monitoring capabilities.

3. PeerDB (Acquired by Clickhouse)

PeerDB, recently acquired by Clickhouse, is a powerful tool for real-time data replication and migration between databases. It specializes in Change Data Capture (CDC) and enables efficient bulk data transfers from popular databases like PostgreSQL to Clickhouse. The acquisition signals Clickhouse’s commitment to making data migration and replication more seamless for enterprise users.

4. DLT

DLT (Data Load Tool) is a Python-based ETL framework that excels at bulk data loading into Clickhouse. It provides a simple yet powerful interface for handling data transformations and insertions, with built-in support for optimal batch sizes and retry mechanisms. What sets DLT apart is its focus on developer experience, offering clear debugging capabilities and extensive documentation for common data loading patterns.

5. Dispatch (Launching soon)

Dispatch is an upcoming Arrow-based serialized ETL tool designed to simplify bulk data ingestion into Clickhouse. By leveraging the Apache Arrow format, Dispatch ensures high-speed data serialization and deserialization, minimizing overhead during data transfers. With its focus on efficient, large-scale data handling, Dispatch can provide seamless integration into existing pipelines while maintaining optimal batch sizes and reducing resource contention. Stay tuned for its release to see how it can transform your data ingestion workflows.

Conclusion

Bulk data ingestion in Clickhouse doesn’t have to be complicated, but it does require understanding some key concepts and choosing the right tools for your specific use case. Whether you’re migrating from a self-hosted instance to Clickhouse Cloud or starting fresh with a new data pipeline, following the best practices outlined above and leveraging appropriate ingestion tools will help ensure smooth, efficient data loading. Remember: batch wisely, pace your inserts, and let the background processes do their work. If you’re looking for a more hands-on approach to bulk data ingestion, connect with us and we can help you start moving your data onto Clickhouse Cloud as soon as possible.