Here I am trying to filter 30 distinct product ids based on the score and rank of each combination of country and gender (US:W; US:M; CA:W; CA:M). Using the code, I am getting the right output, but the only issue is that product IDs are displayed in case they are repeating in the entire dataset of each combination.
for e.g.
product_id_1: Score: 10: rank 1
product_id_1 : Score: 8; rank 3
product_id_1 : Score: 0.1; rank 40
In this example, I should display only product_id_1 and product_id_1 since I am fetching only 30 distinct product ids, but I am getting product_id_1 rank 40 as well.
[![enter code here][1]][1]
I have tried the below approach:
filtered_df = (
df
.groupBy("country", "gender", "product_id")
.agg({"score": "max", "rank": "min"})
.orderBy("country", "gender", desc("max(score)"), "min(rank)")
.groupBy("country", "gender")
.agg(collect_set("product_id").alias("product_ids"))
.select("country", "gender", expr("slice(product_ids, 1, 30)").alias("filtered_product_ids"))
)
filtered_df.show()
Results:
+-------+------+--------------------+
|country|gender|filtered_product_ids|
+-------+------+--------------------+
| US| M|[product_id_2, pr...|
| US| W|[product_id_2, pr...|
| CA| M|[product_id_2, pr...|
| CA| W|[product_id_2, pr...|
+-------+------+--------------------+
In the above code Grouping the Df by "country", "gender", and "product_id". Aggregating the maximum score and minimum rank for each group. Ordering the results by "country", "gender", descending maximum score, and minimum rank. Group the results again by "country" and "gender". For each combination of country and gender, group the product IDs into a list and then select the first 30 unique product IDs to get the top 30 products for that group.