sqlclickhouseclickhouse-client

How to reduce the memory usage of a multiple ASOF JOIN queries?


I have one query that results in ~100 rows. Not really much. This table is called FillsTable.

Then I have mulitple tables SignalTable1,2,3,4,5,6 that have millions of rows.

I want to ASOF JOIN the first table with all Signal tables to get the values of signals "as-of" the time of fills.

I have written the following:

SELECT *, Signal1 FROM (
SELECT *, Signal2 FROM (
SELECT *, Signal3 FROM (
SELECT *, Signal4 FROM (
SELECT *, Signal5 FROM (
SELECT *, Signal6 FROM (
   SELECT * FROM TableFills
   WHERE timestamp between 'start_date' and 'end_date'
) AS F ASOF JOIN SignalTable6 AS S USING (name, timestamp) WHERE S.timestamp between 'start_date' and 'end_date'
) AS F ASOF JOIN SignalTable5 AS S USING (name, timestamp) WHERE S.timestamp between 'start_date' and 'end_date'
) AS F ASOF JOIN SignalTable4 AS S USING (name, timestamp) WHERE S.timestamp between 'start_date' and 'end_date'
) AS F ASOF JOIN SignalTable3 AS S USING (name, timestamp) WHERE S.timestamp between 'start_date' and 'end_date'
) AS F ASOF JOIN SignalTable2 AS S USING (name, timestamp) WHERE S.timestamp between 'start_date' and 'end_date'
) AS F ASOF JOIN SignalTable1 AS S USING (name, timestamp) WHERE S.timestamp between 'start_date' and 'end_date'

I would expect the memory usage of the query to be pretty small - the "inside" query has only 100 rows, and it is joined with one signal table at a time. However, it looks like clickhouse computes hash tables for right table for each of the signal tables, and I am hitting our 6GB memory limit pretty fast.

I tried also the following pattern:

SELECT T1.name, T1.timestamp, T1.signal1, T2.signal2, ...etc...
FROM (
   SELECT * FROM TableFills WHERE timestamp between 'start_date' and 'end_date'
   ASOF JOIN SignalTable1 AS S WHERE S.timestamp between 'start_date' and 'end_date'
) AS T1
INNER JOIN (
   SELECT * FROM TableFills WHERE timestamp between 'start_date' and 'end_date'
   ASOF JOIN SignalTable1 AS S WHERE S.timestamp between 'start_date' and 'end_date'
) AS T2
ON T1.timestamp = T2.timestamp
INNER JOIN (
   SELECT * FROM TableFills WHERE timestamp between 'start_date' and 'end_date'
   ASOF JOIN SignalTable1 AS S WHERE S.timestamp between 'start_date' and 'end_date'
) AS T3
ON T2.timestamp = T3.timestamp
...etc...

However it also resulted in high memory usage.

The only solution looks to me is to execute separate 6 queries and join them on the application side. Is there a better way to ASOF JOIN with multiple tables joining with one table at a time?


Solution

  • It looks like clickhouse does not do the optimization where it can apply WHERE to the right part of ASOF JOIN. Put WHERE parts inside subqueries to be ASOF JOINed. That way you'll reduce the hash table size.

    ...etc...
    SELECT *, Signal6 FROM (
       SELECT * FROM TableFills
       WHERE timestamp between 'start_date' and 'end_date'
    ) AS F ASOF JOIN (
        SELECT * FROM SignalTable6 AS S
        WHERE S.timestamp between 'start_date' and 'end_date'
    ) AS S USING (name, timestamp)
    ...etc...