I have already pre-cleaned the data, and below shows the format of the top 4 rows:
[IN] df.head()
[OUT] Year cleaned
0 1909 acquaint hous receiv follow letter clerk crown...
1 1909 ask secretari state war whether issu statement...
2 1909 i beg present petit sign upward motor car driv...
3 1909 i desir ask secretari state war second lieuten...
4 1909 ask secretari state war whether would introduc...
I have called train_test_split() as follows:
[IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
[Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)
I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:
[IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
X_train = v.fit_transform(X_train)
X_test = v.transform(X_test)
I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...
[IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
smote_model = smote_pipeline.fit(X_train, y_train)
smote_prediction = smote_model.predict(X_test)
... but this results in:
[OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.
I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.
------------------------------------------------------------------------------------------------------------------------------------
EDIT:
The dataset/dataframe (df
) contains 2380 rows across two columns, as shown in df.head()
above. X_train
contains 1785 of these rows in the format of a list of strings (df['cleaned']
) and y_train
also contains 1785 rows in the format of strings (df['Year']
).
Post-vectorization using TfidfVectorizer()
: X_train
and X_test
are converted from pandas.core.series.Series
of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix
of shape '(1785, 126459)' and '(595, 126459)' respectively.
As for the number of classes: using Counter()
, I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned']
data which contains a list of strings extracted from a textual corpus.
The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.
Since there are approximately 200 classes and 1800 samples in the training set, you have on average 9 samples per class. The reason for the error message is that a) probably the data are not perfectly balanced and there are classes with less than 6 samples and b) the number of neighbors is 6. A few solutions for your problem:
Calculate the minimum number of samples (n_samples) among the 199 classes and select n_neighbors
parameter of SMOTE class less or equal to n_samples.
Exclude from oversampling the classes with n_samples < n_neighbors using the ratio
parameter of SMOTE
class.
Use RandomOverSampler
class which does not have a similar restriction.
Combine 3 and 4 solutions: Create a pipeline that is using SMOTE
and RandomOversampler
in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.