I have a table (call it release_dates
) containing dates which looks like this:
+------------+
|release_date|
+------------+
| 1997-06-30|
| 1997-11-14|
| 1998-11-08|
| 1999-04-01|
| 1999-09-08|
| 1999-11-01|
| 2000-11-01|
| 2000-11-01|
| 2001-03-15|
| 2001-06-01|
| 2001-06-01|
+------------+
I want to get the number of releases on a monthly basis, and also the cumulative releases with the same monthly granularity. So with the above table, what I want is:
+----------+-----------+------------+
| date|nb_releases|cum_releases|
+----------+-----------+------------+
|1997-06-01| 1| 1|
|1997-11-01| 1| 2|
|1998-11-01| 1| 3|
|1999-04-01| 1| 4|
|1999-09-01| 1| 5|
|1999-11-01| 1| 6|
|2000-11-01| 2| 8|
|2001-03-01| 1| 9|
|2001-06-01| 2| 11|
+----------+-----------+------------+
I've tried something like
SELECT
any_value(release_date) - DAY(any_value(release_date)) + 1 AS date,
COUNT (*) AS nb_releases,
SUM (COUNT(*)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_releases
FROM
release_dates
WHERE
release_date IS NOT NULL
GROUP BY
YEAR (release_date),
MONTH (release_date)
ORDER BY
YEAR (release_date),
MONTH (release_date)
But I get this error:
A column or function parameter with name
release_date
cannot be resolved
I'm using pyspark version 3.5.3, and am willing to restrain myself to SQL queries (that is, not use pyspark API).
I think you're over complicating this. You are trying to group by and order by objects that are not in your query.
spark.sql("""
SELECT
TRUNC(release_date, "month") as date,
COUNT (*) AS nb_releases,
SUM (COUNT (*)) OVER (
ORDER BY TRUNC(release_date, "month")
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS cum_releases
FROM
release_dates
WHERE
release_date IS NOT NULL
GROUP BY
TRUNC(release_date, "month")
""").show()