I have a data set with 16 columns and 100,000 rows which I'm trying to prepare for a matrix-factorization training. I'm using the following code to split it and turn it into a sparse matrix.
X=data.drop([data.columns[0]],axis='columns')
y=data[[1]]
X=lil_matrix(100000,15).astype('float32')
y=np.array(y).astype('float32')
X
But when I run it, I get this error:
<1x1 sparse matrix of type '' with 1 stored elements in LInked List format> .
When I try to plug it into a training/testing split it gives me further errors:
Found input variables with inconsistent numbers of samples: [1, 100000]
Your linked notebook
is creating a 'blank' sparse matrix, and setting selected elements from data it reads from a csv
.
A simple example of this:
In [565]: from scipy import sparse
In [566]: M = sparse.lil_matrix((10,5), dtype=float)
In [567]: M
Out[567]:
<10x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in LInked List format>
Note that I use (10,5)
to specify the matrix shape. The () matter! That's why I stressed reading the docs
. In the link the relevant line is:
X = lil_matrix((lines, columns)).astype('float32')
Now I can set a couple elements, just as I would an dense array:
In [568]: M[1,2] = 12.3
In [569]: M[3,1] = 1.1
In [570]: M
Out[570]:
<10x5 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in LInked List format>
I can use toarray
to display the matrix as a dense array (don't try this with large dimensions).
In [571]: M.toarray()
Out[571]:
array([[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 12.3, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 1.1, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ]])
If I omit the (), it makes a (1,1) matrix with just one element, the first number.
In [572]: sparse.lil_matrix(10,5)
Out[572]:
<1x1 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in LInked List format>
In [573]: _.A
Out[573]: array([[10]], dtype=int64)
Look again at your code. You set the X
value twice, once it is a dataframe. The second time is this bad lil
initialization. The second time does not make use of the first X
.
X=data.drop([data.columns[0]],axis='columns')
...
X=lil_matrix(100000,15).astype('float32')