pythonpandasscikit-learndata-miningstandardization

Standardize only numerical features with StandardScaler


I have the following dataset :

df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/HR_comma_sep.csv')

I encoded salary first with a label encoder le_salary, and then with an ordinal encoder oe_salary. Then I encoded departmentwith OneHotEncoder ohe_department. I concanated it all and have now a concat_df. Now I want to do a logistic regression but with standardisation and that's where I have a problem. Here are my values and train/test split:

X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']].values
y=concat_df["left"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

I then tried to standardize only numerical values whith the following code :

from sklearn.compose import ColumnTransformer
scaler = StandardScaler()
#select cols to standardize
Cols = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'eval_spent']
#set up preprocessor
preprocessor = ColumnTransformer([('standard', scaler, Cols)], remainder = 'passthrough')
#fit preprocessor
X_train_std = preprocessor.fit_transform(X_train)
X_test_std = preprocessor.transform(X_test)

However I get the following error that I don't undersant since I've already standardized that before without any problems.

AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    408         try:
--> 409             all_columns = X.columns
    410         except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
3 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    410         except AttributeError:
    411             raise ValueError(
--> 412                 "Specifying the columns using strings is only "
    413                 "supported for pandas DataFrames"
    414             )

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Why do I get this error and what does it mean?


Solution

  • By removing the .values to the DataFrame like so :

    X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']]
    y=concat_df["left"]
    

    We should be able to keep a DataFrame format and call them with their column name.

    Furthermore, to remove those warnnings about the column names, we can modify those by doing the following at the start :

    concat_df.columns = ['satisfaction_level',
        'last_evaluation',
        'number_project',
        'average_monthly_hours',
        'time_spent_company',
        'work_accident',
        'promotion_last_5years',
        'IT',
        'RandD',
        'accounting',
        'hr',
        'management',
        'marketing',
        'product_mng',
        'sales',
        'support',
        'technical',
        'oe_salary',
        'eval_spent',
        'left']
    

    And then we can call the new columns names :

    X=concat_df[['satisfaction_level',
        'last_evaluation',
        'number_project',
        'average_monthly_hours',
        'time_spent_company',
        'work_accident',
        'promotion_last_5years',
        'IT',
        'RandD',
        'accounting',
        'hr',
        'management',
        'marketing',
        'product_mng',
        'sales',
        'support',
        'technical',
        'oe_salary',
        'eval_spent']]]
    y=concat_df["left"]