pandasdataframepysparkrolling-sum

Pyspark pendant of Pandas' Rolling given time interval


Is there an pendant for this Pandas functionality in Pyspark?

pandasDataFrame.rolling('2s', min_periods=1).sum()

where the columns in question have timestamps like this

2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:05  3.0
:

(documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html )

:


Solution

  • Use the window function in spark.

    from pyspark.sql import functions as F
    df.withColumn(
        "window",
        F.window("tmst", "2 secondes")
    )