How to check not seen recently data via materialized view in clickhouse

I got this tables

seen_ids table with ids.
source_ids null engine table where data comes
target_ids with fresh ids that was not in seen_ids on the moment when it comes.

I need to store only ids that are not in seen_ids:

    CREATE materialized view mv_fresh
    TO target_ids
    AS SELECT
      *
    FROM source_ids where id not in (select id from seen_ids)

Also Id adds to seen_ids, so then it comes next time it is not added to target_ids.

Other option is to use LEFT JOIN, may be it is also not good idea for large table.

This is example of what I need to do, what ways do you recommend to check on "not seen recently" ids? This example is probably bad design for large seen_ids table (50 mln rows) and large income data.

Solution

Some ways to improve perf:

try to optimize IN operator

CREATE materialized view mv_fresh
    TO target_ids
    AS SELECT
     ..
    FROM source_ids
    where id not in (
        select id from seen_ids 
        where id IN (select id FROM source_ids))

consider to reduce index_granularity for db.seen_ids or use skipping indexes

See Skipping Indexes: set.

CREATE TABLE db.seen_ids_local ON CLUSTER cl
(..)
ORDER BY id
ENGINE = MergeTree()
SETTINGS index_granularity = 256 or 512 or ..;

try to find some data insights that help to reduce the amount of processed data

For example, if id is incremental then can be applied some condition WHERE id > {some_value_that_calculated_by_MV}

use SET engine

Make sense to consider using a table with SET engine to store seen IDs and use it in IN operator.

bulk insert (thousands of rows or more)

It needs to

either change the logic of the service that is writing data to db.source_ids
or use Buffer engine (see related questions on SO)

use distributed calculation (sharding) (split IDs to groups that be proceeded independently)

For example, for 2 shards on each one be proceeded 50 / 2 = 25 mln rows of seen_ids table.

CREATE TABLE db.seen_ids_local ON CLUSTER cl
(..)
ORDER BY id
ENGINE = MergeTree();

/* ! Write data to distributed table to reshard them among shards. Or write data directly to shard based on sharding key. */
CREATE TABLE db.seen_ids ON CLUSTER cl AS db.seen_ids_local
ENGINE = Distributed(cl, db, seen_ids_local, intHash32(id)); /* Reshard data by 'id'. */


CREATE TABLE db.source_ids_local ON CLUSTER cl
(..)
ENGINE = Null;

/* ! Write data to distributed table to reshard them among shards. */
CREATE TABLE db.source_ids ON CLUSTER cl AS db.source_ids_local
ENGINE = Distributed(cl, db, source_ids_local, intHash32(id)); /* Reshard data by 'id'. The same sharding key as for 'seen_ids' */


CREATE TABLE db.target_ids_local ON CLUSTER cl
(..)
ENGINE = MergeTree();

CREATE materialized view db.target_ids_local_mv
TO db.target_ids_local  /* [Optional] Write data to local table */
AS SELECT
..  
FROM db.source_ids_local /* !! Load data from local table */
where id not in (
  select id from db.seen_ids_local /* !! Load data from local table */
)

Look at these docs: Altinity Knowledge Base.