apache-sparkapache-spark-mllibmahout-recommender

How to use secondary user actions with to improve recommendations with Spark ALS?


Is there a way to use secondary user actions derived from the user click stream to improve recommendations when using Spark Mllib ALS?

I have gone through the explicit and implicit feedback based example mentioned here : https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html that uses the same ratings RDD for the train() and trainImplicit() methods.

Does this mean I need to call trainImplicit() on the same model object with a RDD(user,item,action) for each secondary user action? Or train multiple models , retrieve recommendations based on each action and then combine them linearly?

For additional context, the crux of the question is if Spark ALS can model secondary actions like Mahout's spark item similarity job. Any pointers would help.


Solution

  • Disclaimer: I work with Mahout's Spark Item Similarity.

    ALS does not work well for multiple actions in general. First an illustration. The way we consume multiple actions in ALS is to weight one above the other. For instance buy = 5, view = 3. ALS was designed in the days when ratings seemed important and predicting them was the question. We now know that ranking is more important. In any case ALS uses predicted ratings/weights to rank results. This means that a view is really telling ALS nothing since a rating of 3 means what? Like? Dislike? ALS tries to get around this by adding a regularization parameter and this will help in deciding if 3 is a like or not.

    But the problem is more fundamental than that, it is one of user intent. When a user views a product (using the above ecom type example) how much "buy" intent is involved? From my own experience there may be none or there may be a lot. The product was new, or had a flashy image or other clickbait. Or I'm shopping and look at 10 things before buying. I once tested this with a large ecom dataset and found no combination of regularization parameter (used with ALS trainImplicit) and action weights that would beat the offline precision of "buy" events used alone.

    So if you are using ALS, check your results before assuming that combining different events will help. Using two models with ALS doesn't solve the problem either because from buy events you are recommending that a person buy something, from view (or secondary dataset) you are recommending a person view something. The fundamental nature of intent is not solved. A linear combination of recs still mixes the intents and may very well lead to decreased quality.

    What Mahout's Spark Item Similarity does is to correlate views with buys--actually it correlates a primary action, one where you are clear about user intent, with other actions or information about the user. It builds a correlation matrix that in effect scrubs the views of the ones that did not correlate to buys. We can then use the data. This is a very powerful idea because now almost any user attribute, or action (virtually the entire clickstream) may be used in making recs since the correlation is always tested. Often there is little correlation but that's ok, it's an optimization to remove from the calculation since the correlation matrix will add very little to the recs.

    BTW if you find integration of Mahout's Spark Item Similarity daunting compared to using MLlib ALS, I'm about to donate an end-to-end implementation as a template for Prediction.io, all of which is Apache licensed open source.