pythonkerasscikit-learnsklearn-pandas

sparse matrix length is ambiguous


I'm very new to machine learning so this question might sound stupid. i'm following a tutorial on Text Classification but I'm facing an error that I don't have any idea about how to solve.

This is the code I have (it is basically what it is found in the tutorial)

import pandas as pd

filepath_dict = {'yelp':   'data/yelp_labelled.txt',
              'amazon': 'data/amazon_cells_labelled.txt',
              'imdb':   'data/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
df['source'] = source  
df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0:4])


from sklearn.feature_extraction.text import CountVectorizer

df_yelp = df[df['source'] == 'yelp']

sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)


from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)

from keras.models import Sequential
from keras import layers

input_dim = X_train.shape[1] 

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', 
            optimizer='adam', 
            metrics=['accuracy'])
model.summary()

history = model.fit(X_train, y_train,
nb_epoch=100,
verbose=False,
validation_data=(X_test, y_test),
batch_size=10)

When I reach the last line, I get an error

"TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]"

I guess I'll have to perform some kind of transformation on the data I'm using, or that I should try to load those data in a different way. I tried to search on SO already but - being new to all this - I couldn't find anything helpful.

How do I make this work? Ideally I'd like to get not only the solution but also a brief explaination about why the error happened and what the solution does in order to solve it.

thanks!


Solution

  • The reason you're facing this difficulty is that your X_train and X_test are of type <class scipy.sparse.csr.csr_matrix> whereas your model expects it to be a numpy array.

    Try casting them to dense and you're fine to go:

    X_train = X_train.todense()
    X_test = X_test.todense()