apache-sparkmahoutrecommendation-enginemahout-recommender

Spark Item Similarity Interpretation (Cross-Similarity and Similarity)


I've been using Spark Item Similarity through mahout by following the steps in this article:

https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html

I was able to clean my data, setup a local-only spark/hadoop node and all that.

Now, my question relies more in the interpretation of the matrices. I've tried some Google queries with limited success.

I'm creating a multi-modal recommender - and one of my datasets is very similar to the Mahout example.

Example input: Customer ActionName Product 11064612 view 241505 11086047 purchase 110915 11121878 view CERT_DL 11149030 purchase CERT_FS 11104130 view 111401 The output of mahout is 2 sets of matrices. A similarity matrix and a coocurrence matrix.

This is my similarity matrix (I assume mahout uses my "filter1" purchases)

**791207-WP**   791520-WP:11.350536461453885 791520:9.547158147208393 76130142:7.938639976084232 711215:7.0641921646893024 751309:6.805891904514283

So how would I interpret this? If someone purchased 791207-WP they could be interested in 791520-WP? (so I'd use the left part against purchases of a customer and rank products in the right part?).

The row for 791520-WP looks like this:

791520-WP   76151220:18.954662238247693 791604-WP:13.951210170984268

So, in theory, I'd recommend 76151220 to someone who bought 791520-WP, correct?

Part 2 of the question is interpreting the cross-similarity matrix. Remember my filter2 is "views".

How would I interpret this:

**790907**  76120956:14.2824428207241 791500-LXQ2:13.864741460885853 190907:10.735807818360627

I take this matrix as "someone who visited the 76120956 web page ended up purchasing 790907". So I should promote 790907 to customers who bought 76120956 and maybe even add a link between these 2 products on our site, for example.

Or is it "people who visited the webpage of 790907 ended up buying 76120956"?

My plan is not to use these as-is. I'll still use RowSimilarity and different sources to rank products - but I'm missing the basic interpretation of the outputs from mahout.

If you know of any documentation that clarifies this, that would be a great asset to have.

Thank you.


Solution

  • In both cases the matrix is telling you that the item-id key is similar to the listed items by the LLR value attached to each similar item. Similar in the sense that similar users purchased the items. In the second case it is saying that similar people viewed the items and this view also appears to have led of a purchase of the same item.

    Cooccurrence works for purchases alone, cross-occurrence adds the check to make sure the view also correlated with a purchase. This allows you to use both for recommendations.

    The output is meant to be used with a search engine generally and you would use a user's history of purchases and views as a 2 field query against the matrices, one in each field.

    There are analogous methods to find item-based recommendations.

    Better yet, use something like the Universal Recommender here: actionml.com/docs/ur with PredictionIO for an end-to-end system.