clickhouse

How to check not seen recently data via materialized view in clickhouse


I got this tables

I need to store only ids that are not in seen_ids:

    CREATE materialized view mv_fresh
    TO target_ids
    AS SELECT
      *
    FROM source_ids where id not in (select id from seen_ids)

Also Id adds to seen_ids, so then it comes next time it is not added to target_ids.

Other option is to use LEFT JOIN, may be it is also not good idea for large table.

This is example of what I need to do, what ways do you recommend to check on "not seen recently" ids? This example is probably bad design for large seen_ids table (50 mln rows) and large income data.


Solution

  • Some ways to improve perf:

    1. try to optimize IN operator
    CREATE materialized view mv_fresh
        TO target_ids
        AS SELECT
         ..
        FROM source_ids
        where id not in (
            select id from seen_ids 
            where id IN (select id FROM source_ids))
    
    1. consider to reduce index_granularity for db.seen_ids or use skipping indexes

    See Skipping Indexes: set.

    CREATE TABLE db.seen_ids_local ON CLUSTER cl
    (..)
    ORDER BY id
    ENGINE = MergeTree()
    SETTINGS index_granularity = 256 or 512 or ..;
    
    1. try to find some data insights that help to reduce the amount of processed data

    For example, if id is incremental then can be applied some condition WHERE id > {some_value_that_calculated_by_MV}

    1. use SET engine

    Make sense to consider using a table with SET engine to store seen IDs and use it in IN operator.

    1. bulk insert (thousands of rows or more)

    It needs to

    1. use distributed calculation (sharding) (split IDs to groups that be proceeded independently)

    For example, for 2 shards on each one be proceeded 50 / 2 = 25 mln rows of seen_ids table.

    CREATE TABLE db.seen_ids_local ON CLUSTER cl
    (..)
    ORDER BY id
    ENGINE = MergeTree();
    
    /* ! Write data to distributed table to reshard them among shards. Or write data directly to shard based on sharding key. */
    CREATE TABLE db.seen_ids ON CLUSTER cl AS db.seen_ids_local
    ENGINE = Distributed(cl, db, seen_ids_local, intHash32(id)); /* Reshard data by 'id'. */
    
    
    CREATE TABLE db.source_ids_local ON CLUSTER cl
    (..)
    ENGINE = Null;
    
    /* ! Write data to distributed table to reshard them among shards. */
    CREATE TABLE db.source_ids ON CLUSTER cl AS db.source_ids_local
    ENGINE = Distributed(cl, db, source_ids_local, intHash32(id)); /* Reshard data by 'id'. The same sharding key as for 'seen_ids' */
    
    
    CREATE TABLE db.target_ids_local ON CLUSTER cl
    (..)
    ENGINE = MergeTree();
    
    CREATE materialized view db.target_ids_local_mv
    TO db.target_ids_local  /* [Optional] Write data to local table */
    AS SELECT
    ..  
    FROM db.source_ids_local /* !! Load data from local table */
    where id not in (
      select id from db.seen_ids_local /* !! Load data from local table */
    )
    

    Look at these docs: Altinity Knowledge Base.