date pyspark databricks window-functions median

Median calculation over windows - rangeBetween over months in pyspark databricks

I am working in databricks pyspark and I am trying to get the median over the last two months excluding the current month. Also, there are multiple accounts so this must be parttioned but account too.

from pyspark.sql.functions import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )

but I get error message

/databricks/spark/python/pyspark/sql/functions/builtin.py:19025: FutureWarning: Deprecated in 4.0.0, use partitioning.months instead. warnings.warn("Deprecated in 4.0.0, use partitioning.months instead.", FutureWarning)

Second code

from pyspark.sql.functions.partitioning import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )

but I get error message

Argument col should be a Column or str, got int.

input:

account	date	cost
account1	2024-10-01	5.00
account1	2024-10-02	6.00
account1	2024-10-03	7.00
account1	2024-11-01	8.00
account1	2024-11-02	9.00
account1	2024-11-03	10.00
account1	2024-12-01	4.88
account1	2024-12-02	8.46
account1	2024-12-03	9.43

expected output:

account	date	cost	median
account1	2024-10-01	5.00	null
account1	2024-10-02	6.00	null
account1	2024-10-03	7.00	null
account1	2024-11-01	8.00	null
account1	2024-11-02	9.00	null
account1	2024-11-03	10.00	null
account1	2024-12-01	4.88	7.5
account1	2024-12-02	8.46	7.5
account1	2024-12-03	9.43	7.5

Solution

To obtain the median of the month-2 like this :

+--------+----------+----+------+
| account|      date|cost|median|
+--------+----------+----+------+
|account1|2024-10-01| 5.0|  null|
|account1|2024-10-02| 6.0|  null|
|account1|2024-10-03| 7.0|  null|
|account1|2024-11-01| 8.0|  null|
|account1|2024-11-02| 9.0|  null|
|account1|2024-11-03|10.0|  null|
|account1|2024-12-01|4.88|   6.0|
|account1|2024-12-02|8.46|   6.0|
|account1|2024-12-03|9.43|   6.0|
+--------+----------+----+------+

you can use this code :

from pyspark.sql.functions import col, round, expr, percentile_approx, sum
from pyspark.sql.window import Window


# Calculate a month_offset
df = df.withColumn("year", expr("YEAR(date)")).withColumn("month", expr("MONTH(date)"))
df = df.withColumn("month_offset", expr("year * 12 + month"))

# Define the windows spec with a range between -2 and -2 to get the month - 2
window_spec = Window.partitionBy("account").orderBy("month_offset").rangeBetween(-2, -2)


# Apply the median over the windows spec
df = df.withColumn(
    "median", 
    round(percentile_approx("cost", 0.5).over(window_spec),3) 
)

df.select("account", "date", "cost", "median").show()

I hope this will help you :)