pythonscikit-learnrandom-forest

RandomForestClassfier.fit(): ValueError: could not convert string to float


Given is a simple CSV file:

A,B,C
Hello,Hi,0
Hola,Bueno,1

Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so:

cols = ['A','B','C']
col_types = {'A': str, 'B': str, 'C': int}
test = pd.read_csv('test.csv', dtype=col_types)

train_y = test['C'] == 1
train_x = test[cols]

clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(train_x, train_y)

But I just get this traceback when invoking fit():

ValueError: could not convert string to float: 'Bueno'

scikit-learn version is 0.16.1.


Solution

  • You have to do some encoding before using fit(). As it was told fit() does not accept strings, but you solve this.

    There are several classes that can be used :

    Personally, I have post almost the same question on Stack Overflow some time ago. I wanted to have a scalable solution, but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective, but if you have a lot of different strings the matrix will grow very quickly and memory will be required.