What I am working on?
I am working on a Machine Learning project that predicts the price of electric vehicles in the different states of the USA. My goal is to solidify my practical skills. I have done everything in the project, like performing one-hot encoding, training the model, and running the Flask app on localhost. In localhost, I have filled out the form with the following values and then clicked on the submit button:
County: Jefferson
City: PORT TOWNSEND
ZIP Code: 98368
Model Year: 2012
Make: NISSAN
Model: LEAF
Electric Vehicle Type: Battery Electric Vehicle (BEV)
CAFV Eligibility: Clean Alternative Fuel Vehicle Eligible
Legislative District: 24
What issue am I facing?
After submitting the form, I get this error:
ValueError
ValueError: Found unknown categories \['98368'\] in column 2 during transform
Traceback (most recent call last)
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\flask\\app.py", line 1498, in __call__
return self.wsgi_app(environ, start_response)
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\flask\\app.py", line 1476, in wsgi_app
response = self.handle_exception(e)
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\flask\\app.py", line 1473, in wsgi_app
response = self.full_dispatch_request()
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\flask\\app.py", line 882, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\flask\\app.py", line 880, in full_dispatch_request
rv = self.dispatch_request()
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\flask\\app.py", line 865, in dispatch_request
return self.ensure_sync(self.view_functions\[rule.endpoint\])(\*\*view_args) # type: ignore\[no-any-return\]
File "G:\\Machine_Learning_Projects\\austin\\electric_vehicle_price_prediction_2\\app\\routes.py", line 38, in predict
price = predict_price(features)
File "G:\\Machine_Learning_Projects\\austin\\electric_vehicle_price_prediction_2\\app\\model.py", line 29, in predict_price
transformed_features = encoder.transform(features_df)
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\sklearn\\utils_set_output.py", line 157, in wrapped
data_to_wrap = f(self, X, \*args, \*\*kwargs)
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\sklearn\\preprocessing_encoders.py", line 1027, in transform
X_int, X_mask = self.\_transform(
File "C:\\Users\\austin.conda\\envs\\electric_vehicle_price_prediction_2\\lib\\site-packages\\sklearn\\preprocessing_encoders.py", line 200, in \_transform
raise ValueError(msg)
ValueError: Found unknown categories \['98368'\] in column 2 during transform\
What did I try?
I tried using the following code:
Code of the routes.py
file inside the app
folder:
from flask import render_template, request, jsonify
from app import app
from app.model import predict_price
from jinja2 import Environment, FileSystemLoader, PackageLoader, select_autoescape
@app.route('/')
def index():
env = Environment(
loader=PackageLoader("app"),
autoescape=select_autoescape()
)
template = env.get_template("index.html")
return render_template(template)
@app.route('/predict', methods=\['POST'\])
def predict():
data = request.form.to_dict()
# Convert the form data into the correct format for prediction
features = [
data['county'],
data['city'],
data['zip_code'],
data['model_year'],
data['make'],
data['model'],
data['ev_type'],
data['cafv_eligibility'],
data['legislative_district']
]
# Get the prediction result
price = predict_price(features)
return jsonify({'predicted_price': price})
Code of the model.py
file inside the app
folder:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
import joblib
from flask import Flask, render_template
from jinja2 import Environment, FileSystemLoader, PackageLoader, select_autoescape
env = Environment(
loader=PackageLoader("app"),
autoescape=select_autoescape()
)
model = joblib.load('model/ev_price_model.pkl')
def predict_price(features):
encoder = joblib.load('model/encoder.pkl') # Load encoder if needed
features_df = pd.DataFrame([features], columns=['County', 'City', 'ZIP Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Legislative District'])
# Apply encoding, scaling, etc., if necessary
transformed_features = encoder.transform(features_df)
# Make the prediction
price = model.predict(transformed_features)
return price[0] # Assuming it returns a single value
What is the link to my GitHub repository?
Here is the link to my repo:
https://github.com/SteveAustin583/electric-vehicle-price-prediction
What I was expecting?
I was expecting to get the prediction result without any issue. Because I have already performed one-hot encoding.
Can you help me fixing this issue?
The encoder is being applied to the entire dataset, which is probably giving the incorrect results
I think it's better to define separate transforms
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_transformer = StandardScaler()
cat_transformer = OneHotEncoder(handle_unknown='ignore')
# create a preprocessor using columntransformer from sklearn
preprocessor = ColumnTransformer(
transformers=[
('num', num_transformer, numerical_cols),
('cat', cat_transformer, categorical_cols),
]
)
# combine into single pipeline
model = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(random_state=42))
])
Then apply
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
Modify the inference similarly. Let me know if this helps with your problem.
Edit: Another way to separate and convert x data to ohencoded values in a simpler way is:
x_categorical = df.select_dtypes(include=['object']).apply(ohe.fit_transform)
x_numerical = df.select_dtypes(exclude=['object']).values
Then combine
x = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1).values