pythonpandassamplingresamplingsmote

SMOTE - could not convert string to float


I think I'm missing something in the code below.

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


# Split into training and test sets

# Testing Count Vectorizer

X = df[['Spam']]
y = df['Value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


sm =  pd.concat([X_resampled, y_resampled], axis=1)

as I'm getting the error

ValueError: could not convert string to float: ---> 19 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)

Example of data is

Spam                                             Value
Your microsoft account was compromised             1
Manchester United lost against PSG                 0
I like cooking                                     0

I'd consider to transform both train and test sets to fix the issue which is causing the error, but I don't know how to apply to both. I've tried some examples on google, but it hasn't fixed the issue.


Solution

  • convert text data to numeric before applying SMOTE , like below.

    from sklearn.feature_extraction.text import CountVectorizer
    
    vectorizer = CountVectorizer()
    vectorizer.fit(X_train.values.ravel())
    X_train=vectorizer.transform(X_train.values.ravel())
    X_test=vectorizer.transform(X_test.values.ravel())
    X_train=X_train.toarray()
    X_test=X_test.toarray()
    

    and then add your SMOTE code

    x_train = pd.DataFrame(X_train)
    X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)