I'm inserting a lot of CSV data files into remote Clickhouse database that already has a lot of data. I'm doing it using simple script like this:
...
for j in *.csv; do
clickhouse-client --max_insert_threads=32 --receive_timeout=30000 --input_format_allow_errors_num=999999999 --host "..." --database "..." --port 9000 --user "..." --password "..." --query "INSERT INTO ... FORMAT CSV" < "$j"
done
...
So my question is: how to optimize these inserts? I already used these options for optimization:
--max_insert_threads=32 --receive_timeout=30000
Are there any more options in clickhouse-client
I should use for better performance and for what purpose? One file can be like 300-500mb (and sometimes more). According to this article using parallel processes won't help that's why I'm inserting one file at time.
max_insert_threads is not applicable here, it's about insert select
inside CH server.
According to this article using parallel processes won't help
It should help (it depends on CPU and disk power), just try
# parallelism 6 (-P6)
find . -type f -name '*.csv' | xargs -P 6 -n 1 clickhouse-client --input_format_parallel_parsing=0 --receive_timeout=30000 --input_format_allow_errors_num=999999999 --host "..." --database "..." --port 9000 --user "..." --password "..." --query "INSERT INTO ... FORMAT CSV"
I set input_format_parallel_parsing=0
deliberately, it improves total performance in case of multiple parallel loads.