[SOLVED] How to use both string and float DataType in sklearn KNN .fit() method

How to use both string and float DataType in sklearn KNN .fit() method

I have a dataset which contains both string and float DataType and I want to train my KNN model with the dataset but it gives a ValueError saying

could not covert string to float

inputs=data.drop(['HeartDisease'],'columns')
output=data.drop(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope'],'columns')

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(inputs,output,train_size=0.8)

from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier(n_neighbors=31)
model.fit(x_train,y_train)

I was expecting the model to be trained with the particular dataset

Solution

In every ML model, you can't use data strings as is. You have to preprocess your input to convert them into numeric type. Except for natural language processing, you probably have a few number of different text values (categorical features).

Example for 'ChestPainType' column, you should have only 4 values: ['ATA', 'NAP', 'ASY', 'TA']. Now you have to convert this strings as number: 'ATA': 0, 'NAP': 1, 'ASY': 2, 'TA': 3. In Pandas, you can use pd.factorize or pd.get_dummies to do that but if you use sklearn, try LabelEncoder (especially for y target when needed) or OneHotEncoder (sometimes OrdinalEncoder).

The easiest way is to use a ColumnTransformer.

Reproducible example:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

# https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
data = pd.read_csv('heart.csv')

features = data.drop(columns=['HeartDisease'])
target = df['HeartDisease']

# Text features to convert as numeric. 'M': [1, 0], 'F': [0, 1]
feat_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

ct = ColumnTransformer(
    transformers=[('le', OrdinalEncoder(), feat_cols)],
    remainder='passthrough'
)

# Convert your data as numeric values
X = ct.fit_transform(features)
y = np.stack(target.values)

# Create 2 datasets for train and test
X_train, y_train, X_test, y_test = train_test_split(X, y, train_size=0.8)

# Missing step, use `StandardScaler` to normalize numeric values

# Train your model
model = KNeighborsClassifier(n_neighbors=31)
model.fit(X_train, y_train)

# Evaluate your model (63% here)
model.score(X_test, y_test)