sqlpysparkapache-spark-sql

Cumulative sum in a dataframe grouped by year-month


I have a table (call it release_dates) containing dates which looks like this:

+------------+
|release_date|
+------------+
|  1997-06-30|
|  1997-11-14|
|  1998-11-08|
|  1999-04-01|
|  1999-09-08|
|  1999-11-01|
|  2000-11-01|
|  2000-11-01|
|  2001-03-15|
|  2001-06-01|
|  2001-06-01|
+------------+

I want to get the number of releases on a monthly basis, and also the cumulative releases with the same monthly granularity. So with the above table, what I want is:

+----------+-----------+------------+
|      date|nb_releases|cum_releases|
+----------+-----------+------------+
|1997-06-01|          1|           1|
|1997-11-01|          1|           2|
|1998-11-01|          1|           3|
|1999-04-01|          1|           4|
|1999-09-01|          1|           5|
|1999-11-01|          1|           6|
|2000-11-01|          2|           8|
|2001-03-01|          1|           9|
|2001-06-01|          2|          11|
+----------+-----------+------------+

I've tried something like

SELECT
  any_value(release_date) - DAY(any_value(release_date)) + 1 AS date,
  COUNT (*) AS nb_releases,
  SUM (COUNT(*)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_releases
FROM 
  release_dates
WHERE
  release_date IS NOT NULL
GROUP BY
  YEAR (release_date),
  MONTH (release_date)
ORDER BY
  YEAR (release_date),
  MONTH (release_date)

But I get this error:

A column or function parameter with name release_date cannot be resolved

I'm using pyspark version 3.5.3, and am willing to restrain myself to SQL queries (that is, not use pyspark API).


Solution

  • I think you're over complicating this. You are trying to group by and order by objects that are not in your query.

    spark.sql("""
        SELECT
          TRUNC(release_date, "month") as date,
          COUNT (*) AS nb_releases,
          SUM (COUNT (*)) OVER (
            ORDER BY TRUNC(release_date, "month")
            ROWS BETWEEN UNBOUNDED PRECEDING 
            AND CURRENT ROW
          ) AS cum_releases 
        FROM 
          release_dates 
        WHERE 
          release_date IS NOT NULL 
        GROUP BY 
          TRUNC(release_date, "month")
    """).show()