I have a dataframe with anywhere from 2-5 different fruits, and a single column for the count. There will never be repeats of fruits on each row in the dataframe. However, the combination of fruits could have duplicates. The count will always stay the same for these duplicates since it's giving the count of the combination of all the fruits.
I want to keep just a unique row of the combination of fruits, and I want to remove the other duplicate combination rows. Of the duplicate combinations, it does not matter which row is retained and which is removed. If it makes it easier, we can sort by ascending order starting with first_fruit, second_fruit, and then third_fruit.
Lastly, I only want to keep the top X rows in terms of count.
The example below has 3 fruits.
Current Dataframe
df = pd.DataFrame({'first_fruit': ['apple','orange','banana','cherry'],
'second_fruit': ['orange','apple','cherry','apple'],
'third_fruit': ['cherry','cherry','apple','orange'],
'count':['10','10','5','3']})
Desired Dataframe Output
desired_output = pd.DataFrame({'first_fruit': ['apple','banana'],
'second_fruit': ['orange','cherry'],
'thid_fruit': ['cherry','apple'],
'count':['10','5']})
Please let me know if any clarification is needed. I appreciate the support!
You can use row_number()
function with window function, which the partition is created by the combination of the fruit. Then use the limit()
with orderBy()
to control the top X row that you need.
from pyspark.sql import functions as func
df.withColumn(
"group_rank", func.row_number().over(
Window.partitionBy(
func.array_sort(func.array("first_fruit", "second_fruit", "third_fruit"))
).orderBy(
func.desc("count")
)
)
).filter(
func.col("group_rank") == 1
).drop(
"group_rank"
).orderBy(
func.desc("count")
).limit(
2
)
When you use row_number()
function with window function, you need to give a column for ordering, even the count value is the same in your case.