I'm attempting to incorporate various transformations into a scikit-learn
pipeline along with a LightGBM
model. This model aims to predict the prices of second-hand vehicles. Once trained, I plan to integrate this model into an HTML page for practical use.
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib
print(numeric_features)
`['car_year', 'km', 'horse_power', 'cyl_capacity']`
print(categorical_features)
`['make', 'model', 'trimlevel', 'fueltype', 'transmission', 'bodytype', 'color']`
# Define transformers for numeric and categorical features
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('labelencoder', LabelEncoder())])
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Append the LightGBM model to the preprocessing pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', best_lgb_model)
])
# Fit the pipeline to training data
pipeline.fit(X_train, y_train)
The output I get when training is:
LabelEncoder.fit_transform() takes 2 positional arguments but 3 were given
Replace LabelEncoder
with OneHotEncoder
: As pointed out in the comments, LabelEncoder
may only be used for encoding target variables (y-values) and not for features (X-values). The error you get is intended to prevent the misuse of LabelEncoder
with features. For preprocessing categorical features, OneHotEncoder
is more appropriate, especially within a pipeline setup where you need to handle multiple categorical features.
Handling High Cardinality Features with LightGBM
: If you are concerned about the performance implications of OneHotEncoder due to high cardinality in your categorical variables (as you mentioned, up to 100 different values in some columns), you might consider an alternative approach. LightGBM
can handle categorical features directly without the need for explicit one-hot encoding, by specifying the categorical feature indices. You can remove the encoding step for categorical variables and pass the raw categorical data to LightGBM
, specifying the categorical columns using the categorical_feature
parameter in the LightGBM
model. If your DataFrame
's categorical columns are already set as pd.Categorical
, you can make use of categorical_feature='auto'
in LightGBM
. Then you don't need to pass the categorical columns:
# Convert all non-numeric columns in the DataFrame to category
X_train[X_train.select_dtypes(exclude=[np.number]).columns] = X_train.select_dtypes(exclude=[np.number]).apply(lambda x: x.astype('category'))
# Define numeric transformer
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
# Define preprocessor that uses make_column_selector to select numeric features automatically
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, make_column_selector(dtype_include=np.number))
])
# Create and configure the LightGBM model with auto categorical feature handling
best_lgb_model = lgb.LGBMRegressor(categorical_feature='auto')
# Create the full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', best_lgb_model)
])
# Fit the model
pipeline.fit(X_train, y_train)
Note that in your specific example you don't need a pipeline at all. You can leave out the scaling step, since non-linear models like the tree-based LightGBM
do not required scaled features, as they do not rely on distance calculations or gradients where scale matters.