mahout-recommenderrecommendation-engine

Data collection in recommendation systems


What is the correct way to collect user data in a manner which will enable having it available for model construction (offline) and for prediction (online) in recommendation systems? Assuming that:

  1. Prediction is done through multiple servers. Servers have some available memory but are considered stateless from a user data perspective. This means that users may interact with different machines during a session and user data should be available regardless of which machine the user has landed on.
  2. All metadata attached to articles and recommended items (such as classification, article text etc.) is available both online and offline. However, fetching this data requires a db call.
  3. Some user activity needs to be available for inference fairly quickly while other activity may be available few hours after it happened. For instance, after a user clicks on a recommendation, we would like to make this information available as soon as possible. On the other hand, it’s ok to have longer term browsing behavior data available for inference hours after the user has browsed that content.
  4. Data for all users is too large to hold in memory while training.

Question: The system should work at massive scale. What are the main technologies and the main data structures that one would use?


Solution

  • I'd suggest using a system like Apache PredictionIO and the Universal Recommender plugin Engine.

    PIO gets input of user behavior in real-time and tells the Recommender to train in the background. This produces a "model" of user behavior in general. The actual individual user behavior can be from real-time observations and there is no difference in how any data is treated as input to affect recommendations, all can be real-time. So the model is created in the background in batch mode, and real-time user behavior is used to formulate queries so real-time data affects results. This is generally called Lambda style machine learning.

    Most types of recommenders do not allow you to use more than a conversion as evidence of user behavior and therefore as a possible indicator of user preference. The Universal Recommender is the exception, which can use any number of user behaviors.

    PIO and the UR are built on highly scalable services so "production" systems can be scaled horizontally to any scale needed. These include HDFS, HBase, Spark, and Elasticsearch. The Universal Recommender uses the Correlated Cross-Occurrence (CCO) algorithm from modern Apache Mahout, not the older Hadoop based Mahout recommenders.