pythonpython-3.xpandasscikit-learntrain-test-split

Test train split while retaining original dimension


I am trying to split a pandas dataframe of size 610x9724 (610 users x 9724 movies), putting 80% of the non-null values of the dataset into training and 20% of the remaining non-null values into the test set while replacing the 20% removed values from training with null and likewise replacing the removed values from the test set with null (training set and test set would still be 610x9724 but just with more nulls than original dataset).

I would then use SVD on the test set (610x9724) to predict the removed values which are in the test set.

I have tried using sklearn train_test_split but after splitting, the train set becomes dimension 549x9724 and the validation set becomes 61x9724 which makes it difficult to take the RMSE between predicted and test set. Is there an easy way to do this split?

data = df.pivot_table(index='userId', columns='movieId', values='rating')

data_train, data_valid = model_selection.train_test_split(
    data, test_size=0.1, random_state=42
)

print(data.shape) # (610, 9724)
print(data_train.shape) # (549, 9724)
print(data_valid.shape) # (61, 9724)

Solution

  • You can reindex your dataframes to restore the initial dimension. Every values from missing index will be set to NaN:

    train, test = train_test_split(data, test_size=0.2, random_state=42)
    
    train = train.reindex(data.index)
    test = test.reindex(data.index)
    

    Output:

    >>> train.shape
    (610, 9724)
    
    >>> test.shape
    (610, 9724)