I am currently working on the Santander Product Recommendation dataset from Kaggle to make experiments on FPGrowth.
FPGrowth algorithm from pyspark (ML) requires dataframe as item sets:
+---+------------+
| id| items|
+---+------------+
| 0| [A, B, E]|
| 1|[A, B, C, E]|
| 2| [A, B]|
+---+------------+
But the data I have is in this format:
+---+---+---+---+---+---+
| id| A| B| C| D| E|
+---+---+---+---+---+---+
| 0| 1| 1| 0| 0| 1|
| 1| 1| 1| 1| 0| 1|
| 2| 1| 1| 0| 0| 0|
+---+---+---+---+---+---+
I attempted to solve it by replacing 1's with the column names and creating list from them but that did not work.
Is there a way to perform this conversion by using Spark dataframe functions?
Thank you very much!
Use udf:
from pyspark.sql.functions import udf, struct
@udf("array<string>")
def as_basket(row):
return [k for k, v in row.asDict().items() if v]
df.withColumn("basket", as_basket(struct(*df.columns[1:]))).show()