scikit-learn error: The least populated class in y has only 1 member
Asked Answered
N

11

20

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error:

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

However, all classes have at least 15 samples. Why am I getting this error?

X is a pandas DataFrame which represents the data points, y is a pandas DataFrame with one column that contains the target variable.

I cannot post the original data because it's proprietary, but it is fairly reproducible by creating a random pandas DataFrame (X) with 1k rows x 500 columns, and a random pandas DataFrame (y) with the same number of rows (1k) of X, and, for each row the target variable (a categorical label). The y pandas DataFrame should have different categorical labels (e.g. 'class1', 'class2'...) and each labels should have at least 15 occurrences.

Navaho answered 3/4, 2017 at 8:0 Comment(2)
You should post a complete, duplicatable code snippet with complete stack trace of error and samples of data.Sampan
Sometimes this occurs when there are lots of Jpegs and you put few pngs or vice versa. As soon as you remove those pngs it'll go away. Happened with me.Spinode
N
13

The problem was that train_test_split takes as input 2 arrays, but the y array is a one-column matrix. If I pass only the first column of y it works.

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])
Navaho answered 3/4, 2017 at 9:36 Comment(0)
S
11

The main point is if you use stratified CV, then you will get this warning if the number of splits cannot produce all CV splits with the same ratio of all classes in the data. E.g. if you have 2 samples of one class, there will be 2 CV sets with 2 samples of this class, and 3 CV sets with 0 samples, hence the ratio samples for this class does not equal in all CV sets. But the problem is only if there is 0 samples in any of the sets, so if you have at least as many samples as the number of CV splits, i.e. 5 in this case, this warning won't appear.

See https://mcmap.net/q/602724/-valueerror-n_splits-10-cannot-be-greater-than-the-number-of-members-in-each-class.

Spate answered 25/6, 2020 at 20:2 Comment(2)
If you just ignore the warning is it a big issue? Will it just split it like having no stratify=y?Detrimental
It's not a big issue per se, you just need to be aware that one of the split group won't have the same ratio of different types of samples than the rest of them. The bigger the number of samples in each split and the lower the number of missing samples in the last split, the lesser it affects you.Spate
V
2

Do you like "functional" programming? Like confusing your co-workers, and writing everything in one line of code? Are you the type of person who loves nested ternary operators, instead of 2 'if' statements? Are you an Elixir programmer trapped in a Python programmer's body?

If so, the following solution may work for you. It allows you to discover how many members the least-populated class has, in real-time, then adjust your cross-validation value on the fly:

""" Let's say our dataframe is like this, for example:
 
    dogs         weight     size
    ----         ----       ----
    Poodle       14         small
    Maltese      13         small
    Shepherd     45         big
    Retriever    41         big
    Burmese      43         big

The 'least populated class' would be 'small', as it only has 2 members.
If we tried doing more than 2-fold cross validation on this, the results
would be skewed.
"""

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

X = df['weight']
y = df['size']

# Random forest classifier, to classify dogs into big or small
model = RandomForestClassifier()

# Find the number of members in the least-populated class, THIS IS THE LINE WHERE THE MAGIC HAPPENS :)
leastPopulated = [x for d in set(list(y)) for x in list(y) if x == d].count(min([x for d in set(list(y)) for x in list(y) if x == d], key=[x for d in set(list(y)) for x in list(y) if x == d].count))

# I want to know the F1 score at each fold of cross validation.
# This 'fOne' variable will be a list of the F1 score from each fold
fOne = cross_val_score(model, X, y, cv=leastPopulated, scoring='f1_weighted')

# We print the F1 score here
print(f"Average F1 score during cross-validation: {np.mean(fOne)}")
Vandalism answered 15/12, 2022 at 5:47 Comment(0)
A
1

I have the same problem. Some of class has one or two items.(My problem is multi class problem). You can remove or union classes that has less items. I solve my problem like that.

Alderete answered 23/11, 2021 at 13:29 Comment(0)
B
0

Continuing with user2340939's answer. If you really need your train-test splits to be stratified despite the less number of rows in certain class, you can try using the following method. I generally use the same, where I'll make a copy of all the rows of such classes to both the train and test datasets..

from sklearn.model_selection import train_test_split

def get_min_required_rows(test_size=0.2):
    return 1 / test_size

def make_stratified_splits(df, y_col="label", test_size=0.2):
    """
        for any class with rows less than min_required_rows corresponding to the input test_size,
        all the rows associated with the specific class will have a copy in both the train and test splits.
        
        example: if test_size is 0.2 (20% otherwise),
        min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
        where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
    """
    
    id_col = "id"
    temp_col = "same-class-rows"
    
    class_to_counts = df[y_col].value_counts()
    df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
    
    min_required_rows = get_min_required_rows(test_size)
    copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
    valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
    
    X = valid_rows[id_col].tolist()
    y = valid_rows[y_col].tolist()
    
    # notice, this train_test_split is a stratified split
    X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
    
    X_test = X_test + copy_rows[id_col].tolist()
    X_train = X_train + copy_rows[id_col].tolist()
    
    df.drop([temp_col], axis=1, inplace=True)
    
    test_df = df[df[id_col].isin(X_test)].copy(deep=True)
    train_df = df[df[id_col].isin(X_train)].copy(deep=True)
    
    print (f"number of rows in the original dataset: {len(df)}")
    
    test_prop = round(len(test_df) / len(df) * 100, 2)
    train_prop = round(len(train_df) / len(df) * 100, 2)
    print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
    
    return train_df, test_df
Blaise answered 11/5, 2021 at 12:19 Comment(1)
"where I'll make a copy of all the rows of such classes to both the train and test datasets" - this is a grave mistake and very poor advice! The test set should contain nothing from your training data, otherwise the whole analysis is invalid! The point here is not to "bypass" the error message by some programming hack, it is to realize what exactly happens and what actions one can and cannot do with their data.Lubricity
M
0

I had this issue because some of my things to be split were lists, and some were arrays. When I converted the arrays to a list, it worked.

Mylor answered 6/9, 2021 at 23:27 Comment(0)
S
0
from sklearn.model_selection import train_test_split

all_keys = df['Key'].unique().tolist()

t_df = pd.DataFrame()
c_df = pd.DataFrame()

for key in all_keys:
    print(key)
    if df.loc[df['Key']==key].shape[0] < 2 :
        t_df = t_df.append(df.loc[df['Key']==key])
    else:
        df_t, df_c = train_test_split(df.loc[df['Key']==key],test_size=0.2,stratify=df.loc[df['Key']==key]['Key'])
        t_df = t_df.append(df_t)
        c_df = c_df.append(df_c) 
Sweetbread answered 10/2, 2022 at 15:22 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Orthorhombic
P
0

when you use stratify=y, combine the less number of categories under one category for example: filter the labels less than 50 and label them as one single category like "others" or any name then the least populated class error will be solved.

Pauwles answered 20/6, 2022 at 7:44 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Cantaloupe
C
-2

Try this way, It worked for me which also mentioned here:

x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .
Circuit answered 14/2, 2020 at 23:53 Comment(0)
R
-3

remove stratify=y while splitting train and test data

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
Retroactive answered 3/4, 2020 at 23:42 Comment(2)
Why remove it? Can you give a brief explanation?Anew
stratify is, as it states, a function to "stratify" (meaning concidering the distribution of the passed feature) your train/test-split. If you remove it, you are doing something totally different then before: You're splitting data randomly NOT concidering any distribution. Thus, this is a very missleading answer and should not be rated this high as it is.Noguchi
H
-3

Remove stratify.

stratify=y

should only be used in case of classification problems, so that various output classes (say 'good', 'bad') can get equally distributed among train and test data. It is a sampling method in statistics. We should avoid using stratify in regression problems. The below code should work

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
Headman answered 16/8, 2021 at 19:4 Comment(1)
That worked in my case. Thank you.Predicant

© 2022 - 2024 — McMap. All rights reserved.