apache-sparkpysparkapache-spark-sql

how to get first value and last value from dataframe column in pyspark?


I Have Dataframe,I want get first value and last value from DataFrame column.

+----+-----+--------------------+
|test|count|             support|
+----+-----+--------------------+
|   A|    5| 0.23809523809523808|
|   B|    5| 0.23809523809523808|
|   C|    4| 0.19047619047619047|
|   G|    2| 0.09523809523809523|
|   K|    2| 0.09523809523809523|
|   D|    1|0.047619047619047616|
+----+-----+--------------------+

expecting output is from support column first,last value i.e x=[0.23809523809523808,0.047619047619047616.]


Solution

  • You may use collect but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items. Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.

    Another idea would be to use agg with the first and last aggregation function. This does not work! (because the reducers do not necessarily get the records in the order of the dataframe)

    Spark offers a head function, which makes getting the first element very easy. However, spark does not offer any last function. A straightforward approach would be to sort the dataframe backward and use the head function again.

    first=df.head().support
    import pyspark.sql.functions as F
    last=df.orderBy(F.monotonically_increasing_id().desc()).head().support
    

    Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex to index the dataframe and only keep the first and the last elements.

    size = df.count()
    df.rdd.zipWithIndex()\
      .filter(lambda x : x[1] == 0 or x[1] == size-1)\
      .map(lambda x : x[0].support)\
      .collect()