pythondataframeapache-sparkpyspark

select rows to read pyspark dataframe based on a latest date value


I have a table like as shown below since the order numbers reoccur based on a date i would like to read just one of them with the latest date. example is just get A1 for 24/03/2022 on pyspark thanks

This my data table


Solution

  • w = Window.partitionBy('order').orderBy('date')
    
    df = (df
    .withColumn('rank',F.row_number().over(w)))
    
    df = (df
    .filter(df['rank'] == 1).drop('rank'))
    

    I solved this by ranking the Orders by date and selecting the one with the lowest rank 1