pysparkpyspark-pandas

How to filter pyspark dataframe with last 14 days?


I am having a date column in my dataframe

enter image description here

I wanted to filter out the last 14 days from the dataframe using the date column. I tried the below code but it's not working

  last_14 = df.filter((df('Date')> date_add(current_timestamp(), -14)).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")

Event_time, user_id, impressions, clicks, URL is my other columns

Can anyone advise how to do this?


Solution

  • from pyspark.sql import functions as F, types as T
    
    df  = spark.createDataFrame(
          [
        ('2022-03-10',),
        ('2022-03-09',),
        ('2022-03-08',),
        ('2022-02-02',),
        ('2022-02-01',)
          ], ['Date']
        ).withColumn('Date', F.to_date('Date', 'y-M-d'))
    
    df\
        .filter((F.col('Date') > F.date_sub(F.current_date(), 14)))\
        .show()
    
    +----------+
    |      Date|
    +----------+
    |2022-03-10|
    |2022-03-09|
    |2022-03-08|
    +----------+
    

    In your code it would be:

     last_14 = df.filter((F.col('Date') > F.date_sub(F.current_date(), 14))).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")