pythonmachine-learningscikit-learn

Scikit-Learn Classifier model returns all zeroes


So, I'm trying to train a RandomForestClassifier Model. However, when I train it, it gives me all zeroes. And, I really can't seem to understand why. The dataset is HUGE (closer to like 75,0000 rows), so, I am a bit lost. Here is the code. :

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv("train.csv")
X_train = df.iloc[:, 1:-1].values
y_train = df.iloc[:, [-1]].values

df = pd.read_csv("test.csv")
X_test = df.iloc[:, 1:].values

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy = "most_frequent")
imputer.fit(X_train[:, :])
X_train[:, :] = imputer.transform(X_train[:, :])
X_test[:, :] = imputer.transform(X_test[:, :])

int_features = []
categorical_features = []
for i in range(len(X_train[0])) : 
    if type(X_train[0][i]) == int or type(X_train[0][i]) == float : 
        int_features.append(i)
    elif type(X_train[0][i]) == str : 
        categorical_features.append(i)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct_x = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
X_train = np.array(ct_x.fit_transform(X_train))
X_test = np.array(ct_x.transform(X_test))

ct_y = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough', sparse_threshold=0)
y_train = np.array(ct_y.fit_transform(y_train))

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, int_features] = sc.fit_transform(X_train[:, int_features])
X_test[:, int_features] = sc.transform(X_test[:, int_features])

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=500, max_depth=25, random_state=42)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

#print(y_pred)
# Access the OneHotEncoder
ohe = ct_y.named_transformers_['encoder']

# Apply inverse_transform
inverse_transformed_data = ohe.inverse_transform(y_pred)

#print(inverse_transformed_data)
#print(inverse_transformed_data)

Basically, this model has to predict which fertilizer to use based on the data, ie, how good the soil is, etc. I onehotencoded the y_train, because, it containes all the ferilizer names that the model has to predict. However, y_pred just gives me all zeroes, and I don't know why!

Thanks you for any helps or advices you can give.

EDIT. : Here is a small sample of the data! :

train.csv. :

id,Temparature,Humidity,Moisture,Soil Type,Crop Type,Nitrogen,Potassium,Phosphorous,Fertilizer Name
0,37,70,36,Clayey,Sugarcane,36,4,5,28-28
1,27,69,65,Sandy,Millets,30,6,18,28-28
2,29,63,32,Sandy,Millets,24,12,16,17-17-17
3,35,62,54,Sandy,Barley,39,12,4,10-26-26
4,35,58,43,Red,Paddy,37,2,16,DAP
5,30,59,29,Red,Pulses,10,0,9,20-20
6,27,62,53,Sandy,Paddy,26,15,22,28-28
7,36,62,44,Red,Pulses,30,12,35,14-35-14
8,36,51,32,Loamy,Tobacco,19,17,29,17-17-17
9,28,50,35,Red,Tobacco,25,12,16,20-20

test.csv. :

id,Temparature,Humidity,Moisture,Soil Type,Crop Type,Nitrogen,Potassium,Phosphorous
750000,31,70,52,Sandy,Wheat,34,11,24
750001,27,62,45,Red,Sugarcane,30,14,15
750002,28,72,28,Clayey,Ground Nuts,14,15,4
750003,37,53,57,Black,Ground Nuts,18,17,36
750004,31,55,32,Red,Pulses,13,19,14
750005,35,63,34,Black,Millets,36,3,2
750006,38,50,56,Clayey,Sugarcane,32,6,31
750007,25,55,44,Black,Barley,32,9,32
750008,29,56,60,Red,Pulses,26,5,13
750009,25,63,40,Loamy,Sugarcane,9,5,41

Solution

  • Short answer: You should not use One Hot Encoding for the target of your Random Forest classifier.

    Developed answer: From the scikit-learn documentation for the RandomForestClassifier.fit method, you can find that the expected y should be :

    The target values (class labels in classification, real numbers in regression).

    So instead of transforming y_train before training your classifier, you should pass it as the y argument of your classifier directly.

    Following is the code supporting my answer. I also changed your sample training data so that all classes in the test are covered at least once in the training. You will find it below.

    Note: I changed a little bit your for loop that defines the indices of the features, so that it is more "pythonic". Finally, I added # %% to run the code step by step. You are free to remove them, but you should know that if you don't use it, it is more readable to have all of your imports at the very beginning of your file.

    # %%
    import pandas as pd
    import numpy as np
    
    df = pd.read_csv("train.csv")
    X_train = df.iloc[:, 1:-1].values
    y_train = df.iloc[:, [-1]].values
    
    # y_train remains [["28-28"], ["28-28"], ..., ['30-30']] in the following code
    
    df = pd.read_csv("test.csv")
    X_test = df.iloc[:, 1:].values
    
    # %%
    from sklearn.impute import SimpleImputer
    imputer = SimpleImputer(missing_values=np.nan, strategy = "most_frequent")
    imputer.fit(X_train[:, :])
    X_train[:, :] = imputer.transform(X_train[:, :])
    X_test[:, :] = imputer.transform(X_test[:, :])
    
    int_features = []
    categorical_features = []
    for i, x_i in enumerate(X_train[0]):  # More concise and readable in the loop
        if isinstance(x_i, int) or isinstance(x_i, float) : 
            int_features.append(i)
        elif isinstance(x_i, str) : 
            categorical_features.append(i)
    
    # %%
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder
    ct_x = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
    X_train = np.array(ct_x.fit_transform(X_train))
    X_test = np.array(ct_x.transform(X_test))
    
    # %%
    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    
    X_train[:, int_features] = sc.fit_transform(X_train[:, int_features])
    X_test[:, int_features] = sc.transform(X_test[:, int_features])
    
    # %%
    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier(n_estimators=500, max_depth=25, random_state=42)
    classifier.fit(X_train, y_train)
    
    y_pred = classifier.predict(X_test)
    
    print(f"y_pred: {y_pred}")
    

    The new train.csv:

    Note: I didn't change the name of your column Temparature, but I guess it could be corrected in your final code.

    id,Temparature,Humidity,Moisture,Soil Type,Crop Type,Nitrogen,Potassium,Phosphorous,Fertilizer Name
    0,37,70,36,Clayey,Sugarcane,36,4,5,28-28
    1,27,69,65,Sandy,Millets,30,6,18,28-28
    2,29,63,32,Sandy,Millets,24,12,16,17-17-17
    3,35,62,54,Sandy,Barley,39,12,4,10-26-26
    4,35,58,43,Red,Paddy,37,2,16,DAP
    5,30,59,29,Red,Pulses,10,0,9,20-20
    6,27,62,53,Sandy,Paddy,26,15,22,28-28
    7,36,62,44,Red,Pulses,30,12,35,14-35-14
    8,36,51,32,Loamy,Tobacco,19,17,29,17-17-17
    9,28,50,35,Red,Tobacco,25,12,16,20-20
    10,30,45,35,Black,Ground Nuts,20,2,19,28-28
    11,25,69,42,Black,Wheat,25,12,26,30-30