hadoophiveapache-pigmahoutmahout-recommender

Convert a matrix data model to file data model for Mahout recommendation


I have a dataset in the form of a userID, itemID ratings matrix that I am trying to convert to the form {userID, itemID, rating} for use with the Mahout item-based recommender, as described here: https://mahout.apache.org/users/recommender/userbased-5-minutes.html#dataset.

In other words, I want to convert something like this:

    1   2   3
1   1.0 2.0 3.0
2   4.0 5.0 6.0
3   7.0 8.0 9.0

Into something like this:

1,1,1.0
1,2,2.0
1,3,3.0
2,1,4.0
2,2,5.0
2,3,6.0
3,1,7.0
3,2,8.0
3,3,9.0

Is there a way to accomplish this using Apache Hadoop tools (Pig, Hive, etc.)?


Solution

  • You can use explode (in hive):

    if your input table looks like this:

    userID item1 item2 item3
    ----------------------
    1      1.0   2.0   3.0
    2      4.0   5.0   6.0
    3      7.0   8.0   9.0
    

    Then your query can be:

    SELECT userID, split(item_val,'_')[0] as item, split(item_val,'_')[1] as val     
    from ( SELECT userID, 
            array(concat_ws('_','item1',item1),
            concat_ws('_','item2',item2),
            concat_ws('_','item3',item3))  as arr from in_table) a
    LATERAL VIEW explode(arr) exp as item_val;
    

    Explanation: The inner query generates this output:

    userID                 arr
    -----------------------------------------
    1      (item1_1.0   item2_2.0   item3_3.0)
    2      (item1_4.0   item2_5.0   item3_6.0)
    3      (item1_7.0   item2_8.0   item3_9.0)
    

    Then after the explode, each line will have userID, itemID and value - only need to split the itemID and the value.

    Also, if the table's itemIDs are defined as double, you need to CAST(item2 as string) before sending them into concat_ws.