I wanna understand why in this code, I get the following results:
# Import necessary libraries
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
import lightgbm as lgb
# Load Titanic dataset
titanic_data = pd.read_csv('titanic.csv') # Assuming the dataset is stored in 'titanic.csv'
# Select specific features
selected_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Survived']
titanic_data = titanic_data[selected_features]
# Convert categorical features to numerical using one-hot encoding
titanic_data = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'], drop_first=True)
# Extract features and target variable
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']
# Split dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# Train an initial LightGBM model
initial_model = lgb.train({}, lgb.Dataset(X_train, label=y_train), 100)
# Make predictions and calculate F1 score
y_pred_initial = initial_model.predict(X_valid)
f1_initial = f1_score(y_valid, (y_pred_initial > 0.5).astype(int))
# Display F1 Score and Confusion Matrix
print(f"Initial F1 Score: {f1_initial}")
print("Confusion Matrix:")
print(confusion_matrix(y_valid, (y_pred_initial > 0.5).astype(int)))
When printing y_pref_initials
I get an array of probabilities that has negative values and values greater than 1 :
Example:
array([ 0.01546079, 0.17557856, 0.22971758, 1.23292351, 0.60531331,
1.04524314, 0.7637124 , 0.0458202 , 0.63044718, 1.02387605,
0.6441506 , 0.15202829, 0.06836975, 0.12113314, 0.19732339,
0.78233429, 0.37779053, 0.75745862, 0.29348834, -0.08458378,
-0.07173513, 0.73006681, 0.38585976, 0.09324021, -0.02912595,
-0.10779946, 0.22953974, 0.24480956])
Why the probability here is not between 0 and 1?
You need to choose the objective function for your project, specially for a classifier (by default it makes a regression model by minimizing mse, see doc):
params = {
"objective": "binary",
}
initial_model = lgb.train(params, lgb.Dataset(X_train, label=y_train), 100)