sqlmongodbperformanceclickhouseclickhouse-go

ClickHouse - SELECT row of data is too slow


The following problem occurred in our project, which we cannot solve. We have a huge data of our logs, and we go to ClickHouse from MongoDB.

Our table is created like this:

CREATE TABLE IF NOT EXISTS logs ON CLUSTER default (
    raw         String,
    ts          DateTime64(6) MATERIALIZED toDateTime64(JSONExtractString(raw, 'date_time'), 6),
    device_id   String        MATERIALIZED JSONExtractString(raw, 'device_id'),
    level       Int8          MATERIALIZED JSONExtractInt(raw, 'level'),
    context     String        MATERIALIZED JSONExtractString(raw, 'context'),
    event       String        MATERIALIZED JSONExtractString(raw, 'event'),
    event_code  String        MATERIALIZED JSONExtractInt(raw, 'event_code'),
    data        String        MATERIALIZED JSONExtractRaw(raw, 'data'),
    date        Date          DEFAULT toDate(ts),
    week        Date          DEFAULT toMonday(ts)
)
ENGINE ReplicatedReplacingMergeTree()
ORDER BY (device_id, ts)
PARTITION BY week

and I'm running a query like so

SELECT device_id,toDateTime(ts),context,level,event,data 
FROM logs 
WHERE device_id = 'some_uuid'
ORDER BY ts DESC 
LIMIT 10 
OFFSET 0;

this is the result 10 rows in set. Elapsed: 6.23 sec.

And second without order, limit and offset:

SELECT device_id,toDateTime(ts),context,level,event,data 
FROM logs 
WHERE device_id = 'some_uuid'

this is the result Elapsed: 7.994 sec. for each 500 rows of 130000+

Is too slow.

Seems that CH process all the rows in the table. What is wrong and what need to improve the speed of CH?

The same implementation on MongoDB takes 200-500ms max


Solution

  • Egor! When you mentioned, "we go to ClickHouse from MongoDB", did you mean you switched from MongoDB to ClickHouse to store your data? Or you somehow connect to ClickHouse from MongoDB to run queries you're referring to?

    I'm not sure how do you ingest your data, but let's focus on the reading part.

    For MergeTree family ClickHouse writes data in parts. Therefore, it is vital to have a timestamp as a part of your where clause, so ClickHouse can determine which parts you want to read and skip most of the data you don't need. Otherwise, it will scan all the data.

    I would imagine these queries will do the scan faster:

    SELECT device_id,toDateTime(ts),context,level,event,data 
    FROM logs 
    WHERE device_id = 'some_uuid' AND week = '2021-07-05'
    ORDER BY ts DESC 
    LIMIT 10 
    OFFSET 0;
    
    SELECT device_id,toDateTime(ts),context,level,event,data 
    FROM logs 
    WHERE device_id = 'some_uuid' AND week = '2021-07-05';
    

    AFAIK, unless you specified the exact partition format, CH will use partitioning by month (ie toYYYYMM()) for your CREATE TABLE statement. You can check that by looking at system.parts table:

    SELECT
        partition,
        name,
        active
    FROM system.parts
    WHERE table = 'logs'
    

    So, if you want to store data in weekly parts, I would imagine partitioning could be like

    ...
    ORDER BY (device_id, ts)
    PARTITION BY toMonday(week)
    

    This is also a good piece of information: Using Partitions and Primary keys in queries