apache-sparkpysparkdata-processingmarket-basket-analysis

Preparing binary represented data for fpgrowth on Spark


I am currently working on the Santander Product Recommendation dataset from Kaggle to make experiments on FPGrowth.

FPGrowth algorithm from pyspark (ML) requires dataframe as item sets:

+---+------------+
| id|       items|
+---+------------+
|  0|   [A, B, E]|
|  1|[A, B, C, E]|
|  2|      [A, B]|
+---+------------+

But the data I have is in this format:

+---+---+---+---+---+---+
| id|  A|  B|  C|  D|  E|
+---+---+---+---+---+---+
|  0|  1|  1|  0|  0|  1|
|  1|  1|  1|  1|  0|  1|
|  2|  1|  1|  0|  0|  0|
+---+---+---+---+---+---+

I attempted to solve it by replacing 1's with the column names and creating list from them but that did not work.

Is there a way to perform this conversion by using Spark dataframe functions?

Thank you very much!


Solution

  • Use udf:

    from pyspark.sql.functions import udf, struct
    
    @udf("array<string>")
    def as_basket(row):
        return [k for k, v in row.asDict().items() if v]
    
    df.withColumn("basket", as_basket(struct(*df.columns[1:]))).show()