sqlpostgresqltime-seriescumulative-sum

Running total (cumulative sum) optimization


I have to use running total value for the parameter "users". Here is the script (simplified):

with us_data as (
   select
      count(us.*) as us_reg,
      count(us.*) filter (where status = 'active') as us_active,
      date_trunc('day', us.created_at) as rp_date
      from public.account_users us
   group by rp_date)
select
   sum(us_reg) over (rows between unbounded preceding and current row)::int as total_users,
   rp_date
from us_data
order by rp_date;

Since the table is going to grow each day, the script is running slowly as it must calculate all users from the beginning. I can store the data for the day before (or any other previous day) in different table and change the script to get total_users from that day till now. Then I can add saved results and store data again in result table.

Is there any more optimal solution for the task to speed up the calculation of running total for users accounts?


Solution

  • You can just run a window function over an aggregate function in the same SELECT.

    us_active in the CTE was dead freight that I cut out.

    And rows between unbounded preceding and current row is the default, so you don't have to spell it out.

    But you did omit ORDER BY in the window function. I added it:

    SELECT sum(count(*)::int) OVER (ORDER BY date_trunc('day', us.created_at)) AS total_users
         , date_trunc('day', us.created_at) AS rp_date
    FROM   public.account_users us
    GROUP  BY rp_date
    ORDER  BY rp_date;  -- !
    

    Storing the running sum in some kind of materialized view, and only adding the last increment for "yesterday" will make it a lot faster, obviously, for a long series. It adds quite a bit of overhead, so the series has to go on for a while before the added cost pays.

    Related:


    For huge amounts of time series data, consider a specialized fork like Timescale DB.