python-3.xtensorflowneural-networkembeddingcollaborative-filtering

why before embedding, have to make the item be sequential starting at zero


I learn collaborative filtering from this bolg, Deep Learning With Keras: Recommender Systems.

The tutorial is good, and the code working well. Here is my code.

There is one thing confuse me, the author said,

The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero to use for modeling (you'll see why later).

user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()

But he didn't seem to metion the reason, I don't why need to do that.Can some one explain for me?


Solution

  • Embeddings are assumed to be sequential.

    The first input of Embedding is the input dimension. So, if the input exceeds the input dimension the value is ignored. Embedding assumes that max value in the input is input dimension -1 (it starts from 0).

    https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding?hl=ja

    As an example, the following code will generate embeddings only for input [4,3] and will skip the input [7, 8] since input dimension is 5.

    I think it is more clear to explain it with tensorflow;

    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Embedding
    
    model = Sequential()
    model.add(Embedding(5, 1, input_length=2))
    input_array = np.array([[4,3], [7,8]])
    model.compile('rmsprop', 'mse')
    output_array = model.predict(input_array)
    

    You can increase the input dimension to 9 and then you will get embeddings for both inputs.

    You could increase the input dimension to max number + 1 in the original data set, but this is not efficient. It is actually similar to one-hot encoding where sequential data saves great amount of memory.