Given is a simple CSV file:
A,B,C
Hello,Hi,0
Hola,Bueno,1
Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:
cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)
train_y = test['C'] == 1
train_x = test[cols]
clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)
But I just get this traceback when invoking fit():
ValueError: could not convert string to float: 'Bueno'
scikit-learn version is 0.16.1.
You have to do some encoding before using fit()
. As it was told fit()
does not accept strings, but you solve this.
There are several classes that can be used :
LabelEncoder
: turn your string into incremental valueOneHotEncoder
: use One-of-K algorithm to transform your String into integerPersonally, I have post almost the same question on Stack Overflow some time ago. I wanted to have a scalable solution, but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective, but if you have a lot of different strings the matrix will grow very quickly and memory will be required.