pythonmachine-learningtext-classificationsnorkel

Snorkel: Can i have different features in data set to for generating labelling function VS training a classifier?


I have a set of features to build labelling functions (set A) and another set of features to train a sklearn classifier (set B)

The generative model will output a set of probabilisitic labels which i can use to train my classifier.

Do i need to add in the features (set A) that i used for the labelling functions into my classifier features? (set B) Or just use the labels generated to train my classifier?

I was referencing the snorkel spam tutorial and i did not see them use the features in the labelling function set to train a new classifier.

As seem in cell 47, featurization is done entirely using a CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())

X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

And then straight to fitting a keras model:

# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])

keras_model.fit(
    x=X_train,
    y=probs_train_filtered,
    validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
    callbacks=[get_keras_early_stopping()],
    epochs=50,
    verbose=0,
)

Solution

  • I asked the same question to the snorkel github page and this is the response :

    you do not need to add in the features (set A) that you used for LFs into the classifier features. In order to prevent the end model from simply overfitting to the labeling functions, it is better if the features for the LFs and end model (set A and set B) are as different as possible

    https://github.com/snorkel-team/snorkel-tutorials/issues/193#issuecomment-576450705