What is the correct way to collect user data in a manner which will enable having it available for model construction (offline) and for prediction (online) in recommendation systems? Assuming that:
Question: The system should work at massive scale. What are the main technologies and the main data structures that one would use?
I'd suggest using a system like Apache PredictionIO and the Universal Recommender plugin Engine.
PIO gets input of user behavior in real-time and tells the Recommender to train in the background. This produces a "model" of user behavior in general. The actual individual user behavior can be from real-time observations and there is no difference in how any data is treated as input to affect recommendations, all can be real-time. So the model is created in the background in batch mode, and real-time user behavior is used to formulate queries so real-time data affects results. This is generally called Lambda style machine learning.
Most types of recommenders do not allow you to use more than a conversion as evidence of user behavior and therefore as a possible indicator of user preference. The Universal Recommender is the exception, which can use any number of user behaviors.
PIO and the UR are built on highly scalable services so "production" systems can be scaled horizontally to any scale needed. These include HDFS, HBase, Spark, and Elasticsearch. The Universal Recommender uses the Correlated Cross-Occurrence (CCO) algorithm from modern Apache Mahout, not the older Hadoop based Mahout recommenders.