numpyscipysparse-matrixamazon-sagemakermatrix-factorization

how to turn a matrix into a sparse matrix and protobuf it


I have a data set with 16 columns and 100,000 rows which I'm trying to prepare for a matrix-factorization training. I'm using the following code to split it and turn it into a sparse matrix.

X=data.drop([data.columns[0]],axis='columns')
y=data[[1]]
X=lil_matrix(100000,15).astype('float32')
y=np.array(y).astype('float32')
X

But when I run it, I get this error:

<1x1 sparse matrix of type '' with 1 stored elements in LInked List format> .

When I try to plug it into a training/testing split it gives me further errors:

Found input variables with inconsistent numbers of samples: [1, 100000]


Solution

  • Your linked notebook is creating a 'blank' sparse matrix, and setting selected elements from data it reads from a csv.

    A simple example of this:

    In [565]: from scipy import sparse                                                                           
    In [566]: M = sparse.lil_matrix((10,5), dtype=float)                                                         
    In [567]: M                                                                                                  
    Out[567]: 
    <10x5 sparse matrix of type '<class 'numpy.float64'>'
        with 0 stored elements in LInked List format>
    

    Note that I use (10,5) to specify the matrix shape. The () matter! That's why I stressed reading the docs. In the link the relevant line is:

    X = lil_matrix((lines, columns)).astype('float32')
    

    Now I can set a couple elements, just as I would an dense array:

    In [568]: M[1,2] = 12.3                                                                                      
    In [569]: M[3,1] = 1.1                                                                                       
    In [570]: M                                                                                                  
    Out[570]: 
    <10x5 sparse matrix of type '<class 'numpy.float64'>'
        with 2 stored elements in LInked List format>
    

    I can use toarray to display the matrix as a dense array (don't try this with large dimensions).

    In [571]: M.toarray()                                                                                        
    Out[571]: 
    array([[ 0. ,  0. ,  0. ,  0. ,  0. ],
           [ 0. ,  0. , 12.3,  0. ,  0. ],
           [ 0. ,  0. ,  0. ,  0. ,  0. ],
           [ 0. ,  1.1,  0. ,  0. ,  0. ],
           [ 0. ,  0. ,  0. ,  0. ,  0. ],
           [ 0. ,  0. ,  0. ,  0. ,  0. ],
           [ 0. ,  0. ,  0. ,  0. ,  0. ],
           [ 0. ,  0. ,  0. ,  0. ,  0. ],
           [ 0. ,  0. ,  0. ,  0. ,  0. ],
           [ 0. ,  0. ,  0. ,  0. ,  0. ]])
    

    If I omit the (), it makes a (1,1) matrix with just one element, the first number.

    In [572]: sparse.lil_matrix(10,5)                                                                            
    Out[572]: 
    <1x1 sparse matrix of type '<class 'numpy.int64'>'
        with 1 stored elements in LInked List format>
    In [573]: _.A                                                                                                
    Out[573]: array([[10]], dtype=int64)
    

    Look again at your code. You set the X value twice, once it is a dataframe. The second time is this bad lil initialization. The second time does not make use of the first X.

    X=data.drop([data.columns[0]],axis='columns')
    ...
    X=lil_matrix(100000,15).astype('float32')