apache-sparkpysparkassociationsfpgrowth

Pyspark + association rule mining: how to transfer a data frame to a format suitable for frequent pattern mining?


I am trying to use pyspark to do association rule mining. Let's say my data is like:

myItems=spark.createDataFrame([(1,'a'),
                               (1,'b'),
                               (1,'d'),
                               (1,'c'),
                               (2,'a'),
                               (2,'c'),],
                              ['id','item']) 

But according to https://spark.apache.org/docs/2.2.0/ml-frequent-pattern-mining.html, the format should be:

df = spark.createDataFrame([(1, ['a', 'b', 'd','c']),
                            (2, ['a', 'c'])], 
                           ["id", "items"])

So I need to transfer my data from vertical to horizontal and the lengths for all the ids are different.

How can I do this transfer, or is there another way to do it?


Solution

  • Let your original definition of myItems be valid. collect_list will be helpful after you typically group the dataframe by id.

    >>> myItems=spark.createDataFrame([(1,'a'),
    ...                                (1,'b'),
    ...                                (1,'d'),
    ...                                (1,'c'),
    ...                                (2,'a'),
    ...                                (2,'c'),],
    ...                               ['id','item'])
    >>> from pyspark.sql.functions import collect_list
    >>> myItems.groupBy(myItems.id).agg(collect_list('item')).show()
    +---+------------------+
    | id|collect_list(item)|
    +---+------------------+
    |  1|      [a, b, d, c]|
    |  2|            [a, c]|
    +---+------------------+