We've recently started using Clickhouse and wrote a query to filter entries based on sha data like column named sha_data(256 bit hash). We have a primary index based on 4 other columns in which sha_data is not included. We're doing performance testing on the table containing ~1 billion entries in the table. The query was containing inner join which we relaced with in clause as inner join was causing full scan two times. We tried using plain bloom index (bloom_filter(0.01) GRANULARITY 8) as well as set, tokenbf with different parameters but still query is giving issue such as memory limit exceeded and taking execution time more than 50 mins. Currently memory configured is 32 GB still it is insufficient. Can any one please help by suggesting any areas where we can look into for optimizing performance and avoid memory overuse error? Also wondering, why bloom index is not causing to process lesser number of rows and causing full scan. Thanks.
Reducing granularity(from 8 to 4) along with lowering the false positive rate(0.01 to 0.001) was effective in skipping unnecessary granule reads. Setting false positive rate to 0.0001 was giving more accuracy but with bigger bloom filter size and memory usage.
ALTER TABLE testtable ADD INDEX testindex (testColumn) TYPE bloom_filter(0.001) GRANULARITY 4;
ALTER TABLE testtable MATERIALIZE INDEX testindex;
Thanks.