I am working on a binary classification problem that will classify variable stars between Heartbeat stars and ECL stars from their light curve. When I run my code, the recall increases to 1 as the precision suddenly drops to 0.5.
But when I analyze the epochs, this doesn't seem to be happening:
Here's the relevant sections of my code:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
# Convert columns to numeric, coercing errors to NaN
df['HJD-2450000'] = pd.to_numeric(df['HJD-2450000'], errors='coerce')
df['mag'] = pd.to_numeric(df['mag'], errors='coerce')
# Handle any NaN values if they exist
df.dropna(inplace=True)
# Create binary labels
df['Type'] = df['ID'].apply(lambda x: 0 if x.startswith('HB') else 1)
# Feature scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['HJD-2450000', 'mag']])
# Add polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(scaled_features)
# Update DataFrame with new features
df_poly = pd.DataFrame(poly_features, columns=poly.get_feature_names_out())
df_poly['Type'] = df['Type']
# Separate majority and minority classes
X = df_poly.drop('Type', axis=1)
y = df_poly['Type']
# Handle class imbalance with SMOTE
smote = SMOTE(random_state=29130)
X_resampled, y_resampled = smote.fit_resample(X, y)
# Split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=29130)
# Define model inputs
inputs = tf.keras.layers.Input(shape=(X_train.shape[1],), dtype=tf.float32, name='features')
# Model architecture with regularization
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(32, activation='relu')(x)
x = tf.keras.layers.Dropout(0.3)(x)
dense_output = tf.keras.layers.Dense(1, activation='sigmoid')(x)
def create_model(my_inputs, my_outputs, my_learning_rate):
model = tf.keras.Model(inputs=my_inputs, outputs=my_outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=my_learning_rate),
loss='binary_crossentropy',
metrics=[tf.keras.metrics.AUC(name='auc'), tf.keras.metrics.BinaryAccuracy(name='BA')])
return model
learning_rate = 0.001 # Use Adam optimizer with a lower learning rate
my_model = create_model(inputs, dense_output, learning_rate)
# Train the model with early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
def train_model(model, features, labels, epochs, batch_size):
history = model.fit(x=features, y=labels, batch_size=batch_size,
epochs=epochs, validation_split=0.2, shuffle=True, callbacks=[early_stopping])
epochs = history.epoch
hist = pd.DataFrame(history.history)
ba = hist["BA"]
return epochs, ba
epochs, ba = train_model(my_model, X_train, y_train, epochs=50, batch_size=64)
# Plotting the accuracy curve
def plot_the_loss_curve(epochs, ba):
plt.figure()
plt.xlabel("Epoch")
plt.ylabel("Binary Accuracy")
plt.plot(epochs, ba, label="Binary Accuracy")
plt.legend()
plt.ylim([ba.min()*0.94, ba.max()*1.05])
plt.show()
plot_the_loss_curve(epochs, ba)
# Evaluate the model and plot precision-recall curve
def plot_precision_recall_curve(model, features, labels):
# Predict probabilities
y_scores = model.predict(features).ravel()
# Calculate precision and recall at different thresholds
precision, recall, thresholds = precision_recall_curve(labels, y_scores)
# Plot the precision-recall curve
display = PrecisionRecallDisplay(precision=precision, recall=recall)
display.plot()
plt.title("Precision-Recall Curve")
plt.show()
plot_precision_recall_curve(my_model, X_test, y_test)
# Evaluate the model
evaluation = my_model.evaluate(x=X_test, y=y_test, batch_size=64)
print(f"Model AUC: {evaluation[1]}")
I am using the time value and the magnitude (luminosity) value as my features to observe the light curve as the variable star changes over time Is there something I'm missing? What I think is happening is that since this is an imbalanced class set (ECL stars make up 90% of the set) the model is just classifying everything as an ECL star. How do I prevent this from happening? Or specifically, is there something wrong with my features in particular? Do I have to take a cross product of time and magnitude to capture their relation to the computer?
I have tried dropping come of the ECL stars' value, but the precession keeps dropping to 0.5
Your precision-recall curve actually showed a relatively great result! The sudden drop in precision value as recall approaches 1 should not be confused with poor model performance. This link from sklearn explained how to interpret the curve.
The curve shows the performance of my_model
after the last epoch of your training as the return value of train_model
, so the curve has nothing to do with loss and accuracy values in each epoch in the training history.
The precision-recall curve plots different precision-recall values of your model on a dataset (in this case X_test
and y_test
) across various threshold values.
Suppose your test dataset consists of 50 positive and 50 negative. For example, if the threshold is -infinity, all y
values produced by the model on X_test
are greater than the threshold and would therefore be classified as positive, i.e. TP = 50, FP = 50, TN = 0, FN = 0
. This corresponds to the (1, 0.5)
point in the curve since precision = TP/(TP+FP) = 0.5
and recall = TP/(TP+FN) = 1
. Do it with other threshold values and you get the precision-recall curve. The model is said to have better performance if the area under the curve (AUC) is bigger (approaches 1), which your curve is showing.