[SOLVED] Cumulative sum in a dataframe grouped by year-month

Cumulative sum in a dataframe grouped by year-month

I have a table (call it release_dates) containing dates which looks like this:

+------------+
|release_date|
+------------+
|  1997-06-30|
|  1997-11-14|
|  1998-11-08|
|  1999-04-01|
|  1999-09-08|
|  1999-11-01|
|  2000-11-01|
|  2000-11-01|
|  2001-03-15|
|  2001-06-01|
|  2001-06-01|
+------------+

I want to get the number of releases on a monthly basis, and also the cumulative releases with the same monthly granularity. So with the above table, what I want is:

+----------+-----------+------------+
|      date|nb_releases|cum_releases|
+----------+-----------+------------+
|1997-06-01|          1|           1|
|1997-11-01|          1|           2|
|1998-11-01|          1|           3|
|1999-04-01|          1|           4|
|1999-09-01|          1|           5|
|1999-11-01|          1|           6|
|2000-11-01|          2|           8|
|2001-03-01|          1|           9|
|2001-06-01|          2|          11|
+----------+-----------+------------+

I've tried something like

SELECT
  any_value(release_date) - DAY(any_value(release_date)) + 1 AS date,
  COUNT (*) AS nb_releases,
  SUM (COUNT(*)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_releases
FROM 
  release_dates
WHERE
  release_date IS NOT NULL
GROUP BY
  YEAR (release_date),
  MONTH (release_date)
ORDER BY
  YEAR (release_date),
  MONTH (release_date)

But I get this error:

A column or function parameter with name release_date cannot be resolved

I'm using pyspark version 3.5.3, and am willing to restrain myself to SQL queries (that is, not use pyspark API).

Solution

I think you're over complicating this. You are trying to group by and order by objects that are not in your query.

spark.sql("""
    SELECT
      TRUNC(release_date, "month") as date,
      COUNT (*) AS nb_releases,
      SUM (COUNT (*)) OVER (
        ORDER BY TRUNC(release_date, "month")
        ROWS BETWEEN UNBOUNDED PRECEDING 
        AND CURRENT ROW
      ) AS cum_releases 
    FROM 
      release_dates 
    WHERE 
      release_date IS NOT NULL 
    GROUP BY 
      TRUNC(release_date, "month")
""").show()