I'm working on text classification using two distinct dataset, with the aim to use one dataset for training and other other for testing. Please note I do not wish to merge the dataset to prevent leakage (I think that's what it's called). The test dataset is much smaller (~1000 rows) compared to the training dataset (16k rows)
I'm using CountVectorizer and as the two datasets have different vocabularies, it results in different number of columns - which leads to error during prediction step.
ValueError: X has 55229 features, but DecisionTreeClassifier is expecting 387964
features as input.
I've been GPTing and Googling for some time and I'm getting mixed guidance e.g:
Code snippets below:
# read dfs
df_1 = pd.read_csv("data1.csv",header=0) # for training, has text, and class columns
df_2 = pd.read_csv("data2.csv",header=0) # for testing, has text, and class columns
# vectorise
CV1 = CountVectorizer(ngram_range=(1,3), stop_words="english").fit(df_1['text'])
x_train = CV1.transform(df_1['text'])
y_train = df_1['class']
CV2 = CountVectorizer(ngram_range=(1,3), stop_words="english").fit(df_2['text'])
x_test = CV2.transform(df_2['text'])
y_test = df_test['class']
## shapes of objects
## x_test (1589, 55229), y_test(1589,)
## x_train (16716, 387964), y_train(16716,)
# build classifier and predict
classifier = DecisionTreeClassifier(random_state=1234)
model = classifier.fit(x_train,y_train)
y_pred = model.predict(x_test)
# error ValueError: X has 55229 features, but DecisionTreeClassifier is expecting 387964 features as input.
As with every preprocessing step, do not fit on the test set. You should have one instance of CountVectorizer
that you fit_transform
the training set and transform
the test set with.
In your case:
CV = CountVectorizer(ngram_range=(1,3), stop_words="english")
x_train = CV.fit_transform(df_1['text'])
y_train = df_1['class']
x_test = CV.transform(df_2['text'])
y_test = df_test['class']