sqlperformanceclickhousecumulative-sum

Cumulative Sum in ClickHouse


I have clickhouse table with 3 columns: respondent, spot и weight.

Data looks like this:

resp, spot, weight
1, 10, 100
2, 10, 200
3, 10, 300
1, 10, 100
1, 20, 100
3, 20, 300
4, 20, 400
1, 30, 100
2, 30, 200
3, 30, 300
5, 30, 500

spot column is sorted. I need to write a query to calculate the accumulated Weight for Spots in the order of 10, 20, 30 so that each respondent is counted only once. That is, in spot = 20 respondents that were in spot = 10 should be taken into cumulative sum, but only once. In spot = 30 respondents that were in spot = 10 and spot = 20 should be taken into cumulative sum, but only once.

The final result will be as follows:

spot, summed_weight
10, 600
20, 1000
30, 1500

I tried to use group_array function to combine respondent and its weight, and found last rows per spot via row_number function, but RAM consumption had become extremely high. Something like this:

resp, spot, weight, group_array, row_number
1, 10, 100, [(1,100)], 4
2, 10, 200, [(1,100), (2,200)], 3
3, 10, 300, [1,100), (2,200), (3,300)], 2
1, 10, 100, [1,100), (2,200), (3,300)], 1
1, 20, 100, [1,100), (2,200), (3,300)], 3
3, 20, 300, [1,100), (2,200), (3,300)], 2
4, 20, 400, [1,100), (2,200), (3,300), (4,400)], 1
1, 30, 100, [1,100), (2,200), (3,300), (4,400)], 4
2, 30, 200, [1,100), (2,200), (3,300), (4,400)], 3
3, 30, 300, [1,100), (2,200), (3,300), (4,400)], 2
5, 30, 500, [1,100), (2,200), (3,300), (4,400), (5,500)], 1

Solution

  • Thanks to @mark for making me realise what you meant. I think the following should be alittle more efficient

    CREATE TABLE test
    (
        `resp` Int64,
        `spot` Int64,
        `weight` Int64
    )
    ENGINE = Memory
    
    INSERT INTO test VALUES (1,10,100),(2,10,200),(3,10,300),(1,10,100),(1,20,100),(3,20,300),(4,20,400),(1,30,100),(2,30,200),(3,30,300),(5,30,500)
    
    SELECT
        min_spot AS spot,
        sum(sum(weight)) OVER (ORDER BY spot ASC) AS weight
    FROM
    (
        SELECT
            resp,
            min(spot) AS min_spot,
            argMin(weight, spot) AS weight
        FROM test
        GROUP BY resp
    )
    GROUP BY spot