pythonscikit-learnnlptext-classification

How to use different dataset for training and test in text classification while avoiding # of features mismatch?


I'm working on text classification using two distinct dataset, with the aim to use one dataset for training and other other for testing. Please note I do not wish to merge the dataset to prevent leakage (I think that's what it's called). The test dataset is much smaller (~1000 rows) compared to the training dataset (16k rows)

I'm using CountVectorizer and as the two datasets have different vocabularies, it results in different number of columns - which leads to error during prediction step.

ValueError: X has 55229 features, but DecisionTreeClassifier is expecting 387964 
features as input.

I've been GPTing and Googling for some time and I'm getting mixed guidance e.g:

  1. add zero-filled columns to the smaller x_test
  2. use scikit-learn pipeline

Code snippets below:

# read dfs
df_1 = pd.read_csv("data1.csv",header=0) # for training, has text, and class columns
df_2 = pd.read_csv("data2.csv",header=0) # for testing,  has text, and class columns

# vectorise
CV1 = CountVectorizer(ngram_range=(1,3), stop_words="english").fit(df_1['text']) 
x_train = CV1.transform(df_1['text'])
y_train = df_1['class']

CV2 = CountVectorizer(ngram_range=(1,3), stop_words="english").fit(df_2['text']) 
x_test = CV2.transform(df_2['text'])
y_test = df_test['class']

## shapes of objects
## x_test (1589, 55229), y_test(1589,)
## x_train (16716, 387964), y_train(16716,)

# build classifier and predict
classifier = DecisionTreeClassifier(random_state=1234)
model = classifier.fit(x_train,y_train)
y_pred = model.predict(x_test)

# error ValueError: X has 55229 features, but DecisionTreeClassifier is expecting 387964 features as input.

Solution

  • As with every preprocessing step, do not fit on the test set. You should have one instance of CountVectorizer that you fit_transform the training set and transform the test set with.

    In your case:

    CV = CountVectorizer(ngram_range=(1,3), stop_words="english")
    x_train = CV.fit_transform(df_1['text'])
    y_train = df_1['class']
    
    x_test = CV.transform(df_2['text'])
    y_test = df_test['class']