google-bigquery

BigQuery SQL running totals


Any idea how to calculate running total in BigQuery SQL?

id   value   running total
--   -----   -------------
1    1       1
2    2       3
3    4       7
4    7       14
5    9       23
6    12      35
7    13      48
8    16      64
9    22      86
10   42      128
11   57      185
12   58      243
13   59      302
14   60      362 

Not a problem for traditional SQL servers using either correlated scalar query:

SELECT a.id, a.value, (SELECT SUM(b.value)
                       FROM RunTotalTestData b
                       WHERE b.id <= a.id)
FROM   RunTotalTestData a
ORDER BY a.id;

or join:

SELECT a.id, a.value, SUM(b.Value)
FROM   RunTotalTestData a,
       RunTotalTestData b
WHERE b.id <= a.id
GROUP BY a.id, a.value
ORDER BY a.id;

But I couldn't find a way to make it work in BigQuery...


Solution

  • You probably figured it out already. But here is one, not the most efficient, way:

    JOIN can only be done using equality comparisons i.e. b.id <= a.id cannot be used.

    https://developers.google.com/bigquery/docs/query-reference#joins

    This is pretty lame if you ask me. But there is one work around. Just use equality comparison on some dummy value to get the cartesian product and then use WHERE for <=. This is crazily suboptimal. But if your tables are small this is going to work.

    SELECT a.id, SUM(a.value) as rt 
    FROM RunTotalTestData a 
    JOIN RunTotalTestData b ON a.dummy = b.dummy 
    WHERE b.id <= a.id 
    GROUP BY a.id 
    ORDER BY rt
    

    You can manually constrain the time as well:

    SELECT a.id, SUM(a.value) as rt 
    FROM (
        SELECT id, timestamp RunTotalTestData 
        WHERE timestamp >= foo 
        AND timestamp < bar
    ) AS a 
    JOIN (
        SELECT id, timestamp, value RunTotalTestData 
        WHERE timestamp >= foo AND timestamp < bar
    ) b ON a.dummy = b.dummy 
    WHERE b.id <= a.id 
    GROUP BY a.id 
    ORDER BY rt
    

    Update:

    You don't need a special property. You can just use

    SELECT 1 AS one
    

    and join on that.

    As billing goes the join table counts in the processing.