clickhouse

Alternatives to multiSearchAnyCaseInsensitive for huge needles array in clickhouse


I got array of words roughly 10k in size. I need to check text column of rows if there is any of this words in the text. I decided to use multiSearchAnyCaseInsensitive:

AS WITH (select groupArray(word) from bad_words) AS patterns
SELECT
  s.id AS id,
  multiSearchAnyCaseInsensitive(s.text, patterns) AS has_bad_words 
FROM source s
GROUP BY id;

but it has limitation on number of elements in array, so I got this error

Number of arguments for function multiSearchAnyCaseInsensitive doesn't match: passed 10002, should be at most 255

What other options do I have?

P.S. I can't split text by space, because some needles are couple of words.


Solution

  • try following approach

    CREATE TABLE source
    (
        id   UInt32,
        text String
    )
    ENGINE = MergeTree
    ORDER BY id;
    
    CREATE TABLE bad_words
    (
        word String
    )
    ENGINE = MergeTree
    ORDER BY word;
    
    
    SELECT
        s.id AS id,
        count() > 0 AS has_bad_words
    FROM source s
    JOIN bad_words bw
        ON positionCaseInsensitive(s.text, bw.word) > 0
    GROUP BY s.id