pythonpandasmachine-learningscikit-learnknn

How to use both string and float DataType in sklearn KNN .fit() method


I have a dataset which contains both string and float DataType and I want to train my KNN model with the dataset but it gives a ValueError saying

could not covert string to float
inputs=data.drop(['HeartDisease'],'columns')
output=data.drop(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope'],'columns')

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(inputs,output,train_size=0.8)

from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=31)
model.fit(x_train,y_train)

I was expecting the model to be trained with the particular dataset


Solution

  • In every ML model, you can't use data strings as is. You have to preprocess your input to convert them into numeric type. Except for natural language processing, you probably have a few number of different text values (categorical features).

    Example for 'ChestPainType' column, you should have only 4 values: ['ATA', 'NAP', 'ASY', 'TA']. Now you have to convert this strings as number: 'ATA': 0, 'NAP': 1, 'ASY': 2, 'TA': 3. In Pandas, you can use pd.factorize or pd.get_dummies to do that but if you use sklearn, try LabelEncoder (especially for y target when needed) or OneHotEncoder (sometimes OrdinalEncoder).

    The easiest way is to use a ColumnTransformer.

    Reproducible example:

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.model_selection import train_test_split
    from sklearn.compose import ColumnTransformer
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import confusion_matrix
    
    # https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
    data = pd.read_csv('heart.csv')
    
    features = data.drop(columns=['HeartDisease'])
    target = df['HeartDisease']
    
    # Text features to convert as numeric. 'M': [1, 0], 'F': [0, 1]
    feat_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
    
    ct = ColumnTransformer(
        transformers=[('le', OrdinalEncoder(), feat_cols)],
        remainder='passthrough'
    )
    
    # Convert your data as numeric values
    X = ct.fit_transform(features)
    y = np.stack(target.values)
    
    # Create 2 datasets for train and test
    X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8)
    
    # Missing step, use `StandardScaler` to normalize numeric values
    
    # Train your model
    model = KNeighborsClassifier(n_neighbors=31)
    model.fit(X_train, y_train)
    
    # Evaluate your model (63% here)
    model.score(X_test, y_test)