datepysparkdatabrickswindow-functionsmedian

Median calculation over windows - rangeBetween over months in pyspark databricks


I am working in databricks pyspark and I am trying to get the median over the last two months excluding the current month. Also, there are multiple accounts so this must be parttioned but account too.

from pyspark.sql.functions import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )

but I get error message

/databricks/spark/python/pyspark/sql/functions/builtin.py:19025: FutureWarning: Deprecated in 4.0.0, use partitioning.months instead. warnings.warn("Deprecated in 4.0.0, use partitioning.months instead.", FutureWarning)

Second code

from pyspark.sql.functions.partitioning import months
windowSpec = Window.partitionBy("account").orderBy('date').rangeBetween(-months(2), months(0))
df = df.withColumn('median',round(median('cost').over(windowSpec),3) )

but I get error message

Argument col should be a Column or str, got int.

input:

account date cost
account1 2024-10-01 5.00
account1 2024-10-02 6.00
account1 2024-10-03 7.00
account1 2024-11-01 8.00
account1 2024-11-02 9.00
account1 2024-11-03 10.00
account1 2024-12-01 4.88
account1 2024-12-02 8.46
account1 2024-12-03 9.43

expected output:

account date cost median
account1 2024-10-01 5.00 null
account1 2024-10-02 6.00 null
account1 2024-10-03 7.00 null
account1 2024-11-01 8.00 null
account1 2024-11-02 9.00 null
account1 2024-11-03 10.00 null
account1 2024-12-01 4.88 7.5
account1 2024-12-02 8.46 7.5
account1 2024-12-03 9.43 7.5

Solution

  • To obtain the median of the month-2 like this :

    +--------+----------+----+------+
    | account|      date|cost|median|
    +--------+----------+----+------+
    |account1|2024-10-01| 5.0|  null|
    |account1|2024-10-02| 6.0|  null|
    |account1|2024-10-03| 7.0|  null|
    |account1|2024-11-01| 8.0|  null|
    |account1|2024-11-02| 9.0|  null|
    |account1|2024-11-03|10.0|  null|
    |account1|2024-12-01|4.88|   6.0|
    |account1|2024-12-02|8.46|   6.0|
    |account1|2024-12-03|9.43|   6.0|
    +--------+----------+----+------+
    

    you can use this code :

    from pyspark.sql.functions import col, round, expr, percentile_approx, sum
    from pyspark.sql.window import Window
    
    
    # Calculate a month_offset
    df = df.withColumn("year", expr("YEAR(date)")).withColumn("month", expr("MONTH(date)"))
    df = df.withColumn("month_offset", expr("year * 12 + month"))
    
    # Define the windows spec with a range between -2 and -2 to get the month - 2
    window_spec = Window.partitionBy("account").orderBy("month_offset").rangeBetween(-2, -2)
    
    
    # Apply the median over the windows spec
    df = df.withColumn(
        "median", 
        round(percentile_approx("cost", 0.5).over(window_spec),3) 
    )
    
    df.select("account", "date", "cost", "median").show()
    

    I hope this will help you :)