I have a data set with the rating of user ID to all product ID. There are only 5000 products and 10,000 users but the ID is in different number. I would like to transform my dataframe to a coo_sparse_matrix(data, (row,col), shape) but with row and col as the real number of products and users, not the ID. Is there any way to do that? Below is the illustration:
Data frame:
User ID | Product ID | Rating |
---|---|---|
1 | 14 | 0.1 |
1 | 15 | 0.2 |
2 | 14 | 0.3 |
2 | 16 | 0.3 |
5 | 19 | 0.4 |
and expected to have a matrix (in sparse coo form)
ProductID | 14 | 15 | 16 | 19 |
---|---|---|---|---|
UserID | ||||
1 | 0.1 | 0.2 | 0 | 0 |
2 | 0.3 | 0 | 0.3 | 0 |
5 | 0 | 0 | 0 | 0.4 |
because normally the sparse_coo would give a very large matrix with index (1,2,...,19) for product ID and (1,2,3,4,5) for user ID.
This is for my thesis.
Hi hope this helps and good luck with your thesis:
import pandas as pd
from scipy.sparse import coo_matrix
dataframe=pd.DataFrame(data={'User ID':[1,1,2,2,5], 'Product ID':[14,15,14,16,19], 'Rating':[0.1,0.2,0.3,0.3,0.4]})
row=dataframe['User ID']
col=dataframe['Product ID']
data=dataframe['Rating']
coo=coo_matrix((data, (row, col))).toarray()
new_dataframe=pd.DataFrame(coo)
#Drop non existing Product IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[:, (new_dataframe != new_dataframe.iloc[0]).any()]
#Drop non existing User IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[(new_dataframe!=0).any(axis=1)]
print(new_dataframe)
Output:
14 15 16 19
1 0.1 0.2 0.0 0.0
2 0.3 0.0 0.3 0.0
5 0.0 0.0 0.0 0.4