I Have Dataframe,I want get first value and last value from DataFrame column.
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| G| 2| 0.09523809523809523|
| K| 2| 0.09523809523809523|
| D| 1|0.047619047619047616|
+----+-----+--------------------+
expecting output is from support column first,last value i.e x=[0.23809523809523808,0.047619047619047616.]
You may use collect
but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items. Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.
Another idea would be to use agg
with the first
and last
aggregation function. This does not work! (because the reducers do not necessarily get the records in the order of the dataframe)
Spark offers a head
function, which makes getting the first element very easy. However, spark does not offer any last
function. A straightforward approach would be to sort the dataframe backward and use the head
function again.
first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support
Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex
to index the dataframe and only keep the first and the last elements.
size = df.count()
df.rdd.zipWithIndex()\
.filter(lambda x : x[1] == 0 or x[1] == size-1)\
.map(lambda x : x[0].support)\
.collect()