[SOLVED] Using rangeBetween considering months rather than days in PySpark

Using rangeBetween considering months rather than days in PySpark

I'm looking how to translate this chunk of SQL code into PySpark syntax.

SELECT MEAN(some_value) OVER (
    ORDER BY yyyy_mm_dd
    RANGE BETWEEN INTERVAL 3 MONTHS PRECEDING AND CURRENT ROW
    ) AS mean
FROM
    df

If the above was a range expressed in days, this could easily have been done using something like

.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)

(See also ZygD's solution here: Spark Window Functions - rangeBetween dates)

For a range in months, this however doesn't work as the number of days in a month is not a constant. Any idea how to perform a range considering months using PySpark syntax?

Solution

You can "borrow" the full SQL column expression and use it in PySpark.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('2022-05-01', 1),
     ('2022-06-01', 2),
     ('2022-07-01', 3),
     ('2022-08-01', 4),
     ('2022-09-01', 5)],      
    ['yyyy_mm_dd', 'some_value']
).withColumn('yyyy_mm_dd', F.to_date('yyyy_mm_dd'))

Script:

df = df.withColumn('mean', F.expr("mean(some_value) over (order by yyyy_mm_dd range between interval 3 months preceding and current row)"))

df.show()
# +----------+----------+----+
# |yyyy_mm_dd|some_value|mean|
# +----------+----------+----+
# |2022-05-01|         1| 1.0|
# |2022-06-01|         2| 1.5|
# |2022-07-01|         3| 2.0|
# |2022-08-01|         4| 2.5|
# |2022-09-01|         5| 3.5|
# +----------+----------+----+