I am working in databricks pyspark and I am trying to get the median over the last two months excluding the current month. Also, there are multiple accounts so this must be parttioned but account too.
from pyspark.sql.functions import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )
but I get error message
/databricks/spark/python/pyspark/sql/functions/builtin.py:19025: FutureWarning: Deprecated in 4.0.0, use partitioning.months instead. warnings.warn("Deprecated in 4.0.0, use partitioning.months instead.", FutureWarning)
Second code
from pyspark.sql.functions.partitioning import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )
but I get error message
Argument
col
should be a Column or str, got int.
input:
account | date | cost |
---|---|---|
account1 | 2024-10-01 | 5.00 |
account1 | 2024-10-02 | 6.00 |
account1 | 2024-10-03 | 7.00 |
account1 | 2024-11-01 | 8.00 |
account1 | 2024-11-02 | 9.00 |
account1 | 2024-11-03 | 10.00 |
account1 | 2024-12-01 | 4.88 |
account1 | 2024-12-02 | 8.46 |
account1 | 2024-12-03 | 9.43 |
expected output:
account | date | cost | median |
---|---|---|---|
account1 | 2024-10-01 | 5.00 | null |
account1 | 2024-10-02 | 6.00 | null |
account1 | 2024-10-03 | 7.00 | null |
account1 | 2024-11-01 | 8.00 | null |
account1 | 2024-11-02 | 9.00 | null |
account1 | 2024-11-03 | 10.00 | null |
account1 | 2024-12-01 | 4.88 | 7.5 |
account1 | 2024-12-02 | 8.46 | 7.5 |
account1 | 2024-12-03 | 9.43 | 7.5 |
To obtain the median of the month-2 like this :
+--------+----------+----+------+
| account| date|cost|median|
+--------+----------+----+------+
|account1|2024-10-01| 5.0| null|
|account1|2024-10-02| 6.0| null|
|account1|2024-10-03| 7.0| null|
|account1|2024-11-01| 8.0| null|
|account1|2024-11-02| 9.0| null|
|account1|2024-11-03|10.0| null|
|account1|2024-12-01|4.88| 6.0|
|account1|2024-12-02|8.46| 6.0|
|account1|2024-12-03|9.43| 6.0|
+--------+----------+----+------+
you can use this code :
from pyspark.sql.functions import col, round, expr, percentile_approx, sum
from pyspark.sql.window import Window
# Calculate a month_offset
df = df.withColumn("year", expr("YEAR(date)")).withColumn("month", expr("MONTH(date)"))
df = df.withColumn("month_offset", expr("year * 12 + month"))
# Define the windows spec with a range between -2 and -2 to get the month - 2
window_spec = Window.partitionBy("account").orderBy("month_offset").rangeBetween(-2, -2)
# Apply the median over the windows spec
df = df.withColumn(
"median",
round(percentile_approx("cost", 0.5).over(window_spec),3)
)
df.select("account", "date", "cost", "median").show()
I hope this will help you :)