pythonwindowspysparksliding

PySpark : Do a simple sliding window on n elements and aggregate by a function


I know this subject is already posted but I still don't understand the windows function in pyspark. I just want to do this on a pyspsark dataframe : data.rolling(5).agg('sum') -> this is in Pandas.

I want it in pyspark. No need to groupby or orderby, just slide a window on a column and calcul the sum (or my own function).

Example :

df = pd.DataFrame({'A': [1,1,2,2,1,2],
                    'B': [2,2,3,4,2,1]})

print(df)
   A  B
0  1  2
1  1  2
2  2  3
3  2  4
4  1  2
5  2  1

Result :

print(df.rolling(3).agg('sum'))
     A    B
0  NaN  NaN
1  NaN  NaN
2  4.0  7.0
3  5.0  9.0
4  5.0  9.0
5  5.0  7.0

Thanks


Solution

  • You can achieve this by creating a single window and limiting rows to aggregate

    from pyspark.sql import Window
    from pyspark.sql.functions import *
    
    
    df1.show()
    +---+---+
    | v1| v2|
    +---+---+
    |  1|  2|
    |  1|  4|
    |  2|  2|
    |  2|  4|
    |  2|  4|
    |  2|  4|
    |  2|  4|
    |  2|  4|
    +---+---+
    
    
    w = Window().partitionBy(lit(1)).rowsBetween(-2,0)
    df1.select(sum('v1').over(w).alias('v1'),sum('v2').over(w).alias('v2')).show()
    
    +---+---+
    | v1| v2|
    +---+---+
    |  1|  2|
    |  2|  6|
    |  4|  8|
    |  5| 10|
    |  6| 10|
    |  6| 12|
    |  6| 12|
    |  6| 12|
    +---+---+
    

    You can explicitly set first two rows null if you want