pythonscikit-learnpipelinelabel-encoding

Pipeline for ML model using LabelEncoding in a Transformer


I'm attempting to incorporate various transformations into a scikit-learn pipeline along with a LightGBM model. This model aims to predict the prices of second-hand vehicles. Once trained, I plan to integrate this model into an HTML page for practical use.

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib

print(numeric_features)
`['car_year', 'km', 'horse_power', 'cyl_capacity']`
print(categorical_features)
`['make', 'model', 'trimlevel', 'fueltype', 'transmission', 'bodytype', 'color']`

# Define transformers for numeric and categorical features
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('labelencoder', LabelEncoder())])

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Append the LightGBM model to the preprocessing pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', best_lgb_model)
])

# Fit the pipeline to training data
pipeline.fit(X_train, y_train)

The output I get when training is:

LabelEncoder.fit_transform() takes 2 positional arguments but 3 were given


Solution

  • Replace LabelEncoder with OneHotEncoder: As pointed out in the comments, LabelEncoder may only be used for encoding target variables (y-values) and not for features (X-values). The error you get is intended to prevent the misuse of LabelEncoder with features. For preprocessing categorical features, OneHotEncoder is more appropriate, especially within a pipeline setup where you need to handle multiple categorical features.

    Handling High Cardinality Features with LightGBM: If you are concerned about the performance implications of OneHotEncoder due to high cardinality in your categorical variables (as you mentioned, up to 100 different values in some columns), you might consider an alternative approach. LightGBM can handle categorical features directly without the need for explicit one-hot encoding, by specifying the categorical feature indices. You can remove the encoding step for categorical variables and pass the raw categorical data to LightGBM, specifying the categorical columns using the categorical_feature parameter in the LightGBM model. If your DataFrame's categorical columns are already set as pd.Categorical, you can make use of categorical_feature='auto' in LightGBM. Then you don't need to pass the categorical columns:

    # Convert all non-numeric columns in the DataFrame to category
    X_train[X_train.select_dtypes(exclude=[np.number]).columns] = X_train.select_dtypes(exclude=[np.number]).apply(lambda x: x.astype('category'))
    
    # Define numeric transformer
    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
    
    # Define preprocessor that uses make_column_selector to select numeric features automatically
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, make_column_selector(dtype_include=np.number))
    ])
    
    # Create and configure the LightGBM model with auto categorical feature handling
    best_lgb_model = lgb.LGBMRegressor(categorical_feature='auto')
    
    # Create the full pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', best_lgb_model)
    ])
    
    # Fit the model
    pipeline.fit(X_train, y_train)
    

    Note that in your specific example you don't need a pipeline at all. You can leave out the scaling step, since non-linear models like the tree-based LightGBM do not required scaled features, as they do not rely on distance calculations or gradients where scale matters.