SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors
Asked Answered
I

5

16

I have already pre-cleaned the data, and below shows the format of the top 4 rows:

     [IN] df.head()

    [OUT]   Year    cleaned
         0  1909    acquaint hous receiv follow letter clerk crown...
         1  1909    ask secretari state war whether issu statement...
         2  1909    i beg present petit sign upward motor car driv...
         3  1909    i desir ask secretari state war second lieuten...
         4  1909    ask secretari state war whether would introduc...

I have called train_test_split() as follows:

     [IN] X_train, X_test, y_train, y_test = train_test_split(df['cleaned'], df['Year'], random_state=2)
   [Note*] `X_train` and `y_train` are now Pandas.core.series.Series of shape (1785,) and `X_test` and `y_test` are also Pandas.core.series.Series of shape (595,)

I have then vectorized the X training and testing data using the following TfidfVectorizer and fit/transform procedures:

     [IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1, 1), sublinear_tf=True)
          X_train = v.fit_transform(X_train)
          X_test = v.transform(X_test)

I'm now at the stage where I would normally apply a classifier, etc (if this were a balanced set of data). However, I initialize imblearn's SMOTE() class (to perform over-sampling)...

     [IN] smote_pipeline = make_pipeline_imb(SMOTE(), classifier(random_state=42))
          smote_model = smote_pipeline.fit(X_train, y_train)
          smote_prediction = smote_model.predict(X_test)

... but this results in:

     [OUT] ValueError: "Expected n_neighbors <= n_samples, but n_samples = 5, n_neighbors = 6.

I've attempted to whittle down the number of n_neighbors but to no avail, any tips or advice would be much appreciated. Thanks for reading.

------------------------------------------------------------------------------------------------------------------------------------

EDIT:

Full Traceback

The dataset/dataframe (df) contains 2380 rows across two columns, as shown in df.head() above. X_train contains 1785 of these rows in the format of a list of strings (df['cleaned']) and y_train also contains 1785 rows in the format of strings (df['Year']).

Post-vectorization using TfidfVectorizer(): X_train and X_test are converted from pandas.core.series.Series of shape '(1785,)' and '(595,)' respectively, to scipy.sparse.csr.csr_matrix of shape '(1785, 126459)' and '(595, 126459)' respectively.

As for the number of classes: using Counter(), I've calculated that there are 199 classes (Years), each instance of a class is attached to one element of aforementioned df['cleaned'] data which contains a list of strings extracted from a textual corpus.

The objective of this process is to automatically determine/guess the year, decade or century (any degree of classification will do!) of input textual data based on vocabularly present.

Influence answered 20/3, 2018 at 23:48 Comment(4)
The error message is pretty self-explanatory, isn't it? I guess you need more samples (rows) in your X_trainSegovia
Please add the complete stack trace of error.Vigor
Also please tell us your class imbalance. How many classes and how many samples in each class?Meditate
Thanks for your responses everyone, I've done my best to address your questions in my edit to the original post. Please let me know if there's anything I could correct at all!Influence
C
26

Since there are approximately 200 classes and 1800 samples in the training set, you have on average 9 samples per class. The reason for the error message is that a) probably the data are not perfectly balanced and there are classes with less than 6 samples and b) the number of neighbors is 6. A few solutions for your problem:

  1. Calculate the minimum number of samples (n_samples) among the 199 classes and select n_neighbors parameter of SMOTE class less or equal to n_samples.

  2. Exclude from oversampling the classes with n_samples < n_neighbors using the ratio parameter of SMOTE class.

  3. Use RandomOverSampler class which does not have a similar restriction.

  4. Combine 3 and 4 solutions: Create a pipeline that is using SMOTE and RandomOversampler in a way that satisfies the condition n_neighbors <= n_samples for smoted classes and uses random oversampling when the condition is not satisfied.

Claudy answered 22/3, 2018 at 0:35 Comment(3)
Thanks for the answer, what about reducing the number of classes to e.g. decades (for prediction alone) instead of individual years? I'll crack on with your suggestions in the meantime!Influence
I was unable to investigate (1) and (2) as some classes only possess one sample. However, I was able to successfully pipeline RandomOverSampler (and/or a FakeSampler Class), followed by SMOTE and the Classifier as shown: make_pipeline(sampler, SMOTE(), clf). I'll proceed with this and see what I can do with it! Thanks for your time!Influence
@Dbercules: hi, can you please guide me, how did you do make the pipeline? I tried sm = SMOTE(random_state=42) rm = RandomOverSampler(random_state=42) my_pipe = make_pipeline(sm, rm) X_res, Y_res = my_pipe.fit_resample(X, y) But got the error, same as the title questionTight
P
7

Try to do the below code for SMOTE

oversampler=SMOTE(kind='regular',k_neighbors=2)

This worked for me.

Pelaga answered 28/6, 2019 at 14:29 Comment(1)
I got this error TypeError: __init__() got an unexpected keyword argument 'kind' Tether
A
3

WHY IT OCCURS:

In my case it was occurring because i had as few samples as 1 for some of the values/categories. Since SMOTE is based on KNN concept, it's not possible to apply SMOTE on 1 sampled values.

HOW I SOLVED IT:

Since those 1 sampled values/categories were equivalent to outliers, i removed them from the dataset and then applied SMOTE and it worked.

Also try decreasing the k_neighbors parameter to make it work

xr, yr = SMOTE(k_neighbors=3).fit_resample(x, y)
Alburga answered 10/4, 2021 at 16:49 Comment(0)
S
0

I think that's possible to use the code:

sampler = SMOTE(ratio={1: 1927, 0: 300},random_state=0)

Succubus answered 3/9, 2019 at 9:27 Comment(0)
S
0

I was able to solve this issue following number 1 of this answer.

from collections import Counter

Count(df) # get the classes

# drop the classes with 1 as their value because it's lower than k_neighbors which has 2 as minimum value in my case

X_res, y_res = SMOTE(k_neighbors = 2).fit_resample(X, y)
Spirogyra answered 13/12, 2022 at 1:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.