Pytorch - how to undersample using weightedrandomsampler

Asked 20/2, 2020 at 12:43 Answered 2/6, 2021 at 9:42

neural-network pytorch imbalanced-data conv-neural-network

I have an unbalanced dataset and would like to undersample the class that is overrepresented.How do I go about it. I would like to use to weightedrandomsampler but I am also open to other suggestions.

So far I am assuming that my code will have to be structured kind of like the following. But I dont know how to exaclty do it.

trainset = datasets.ImageFolder(path_train,transform=transform) ... sampler = data.WeightedRandomSampler(weights=..., num_samples=..., replacement=...) ... trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)

I hope someone can help. Thanks a lot

Alfieri answered 20/2, 2020 at 12:43 Comment(0)

From my understanding, pytorch WeightedRandomSampler 'weights' argument is somewhat similar to numpy.random.choice 'p' argument which is the probability that a sample will get randomly selected. Pytorch uses weights instead to random sample training examples and they state in the doc that the weights don't have to sum to 1 so that's what I mean that it's not exactly like numpy's random choice. The stronger the weight, the more likely that sample will get sampled.

When you have replacement=True, it means that training examples can be drawn more than once which means you can have copies of training examples in your train set that get used to train your model; oversampling. Alongside, if the weights are low COMPARED TO THE OTHER TRAINING SAMPLE WEIGHTS the opposite occurs which means that those samples have a lower chance of being selected for random sampling; undersampling.

I have no clue how the num_samples argument works when using it with the train loader but I can warn you to NOT put your batch size there. Today, I tried putting the batch size and it gave horrible results. My co-worker put the number of classes*100 and his results were much better. All I know is that you should not put the batch size there. I also tried putting the size of all my training data for num_samples and it had better results but took forever to train. Either way, play around with it and see what works best for you. I would guess that the safe bet is to use the number of training examples for the num_samples argument.

Here's the example I saw somebody else use and I use it as well for binary classification. It seems to work just fine. You take the inverse of the number of training examples for each class and you set all training examples with that class its respective weight.

A quick example using your trainset object

labels = np.array(trainset.samples)[:,1] # turn to array and take all of column index 1 which are the labels

labels = labels.astype(int) # change to int

majority_weight = 1/num_of_majority_class_training_examples

minority_weight = 1/num_of_minority_class_training_examples

sample_weights = np.array([majority_weight, minority_weight]) # This is assuming that your minority class is the integer 1 in the labels object. If not, switch places so it's minority_weight, majority_weight.

weights = samples_weights[labels] # this goes through each training example and uses the labels 0 and 1 as the index in sample_weights object which is the weight you want for that class.

sampler = WeightedRandomSampler(weights=weights, num_samples=, replacement=True)

trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)

Since the pytorch doc says that the weights don't have to sum to 1, I think you can also just use the ratio which between the imbalanced classes. For example, if you had 100 training examples of the majority class and 50 training examples of the minority class, it would be a 2:1 ratio. To counterbalance this, I think you can just use a weight of 1.0 for each majority class training example and a weight 2.0 for all minority class training examples because technically you want the minority class to be 2 times more likely to be selected which would balance your classes during random selection.

I hope this helped a little bit. Sorry for the sloppy writing, I was in a huge rush and saw that nobody answered. I struggled through this myself without being able to find any help for it either. If it doesn't make sense just say so and I'll re-edit it and make it more clear when I get free time.

Phobia answered 9/4, 2020 at 16:47 Comment(1)

num_samples is for the total amount of samples that are drawn when fully iterating through the entire dataset. So normally you want this to be equal to len(dataset). – Dipietro 6/11, 2020 at 18:58

Based on torchdata (disclaimer: I'm the author) one can create a custom undersampler.

First, _Equalizer base class which:

creates multiple RandomSubsetSamplers (one for each class)
based on function (torch.max or torch.min) will behave as oversampler or undersampler

Code:

class _Equalizer(Sampler):
    def __init__(self, labels: torch.tensor, function):
        if len(labels.shape) > 1:
            raise ValueError(
                "labels can only have a single dimension (N, ), got shape: {}".format(
                    labels.shape
                )
            )
        tensors = [
            torch.nonzero(labels == i, as_tuple=False).flatten()
            for i in torch.unique(labels)
        ]
        self.samples_per_label = getattr(builtins, function)(map(len, tensors))
        self.samplers = [
            iter(
                RandomSubsetSampler(
                    tensor,
                    replacement=len(tensor) < self.samples_per_label,
                    num_samples=self.samples_per_label
                    if len(tensor) < self.samples_per_label
                    else None,
                )
            )
            for tensor in tensors
        ]

    @property
    def num_samples(self):
        return self.samples_per_label * len(self.samplers)

    def __iter__(self):
        for _ in range(self.samples_per_label):
            for index in torch.randperm(len(self.samplers)).tolist():
                yield next(self.samplers[index])

    def __len__(self):
        return self.num_samples

Now, we can create undersampler (added oversampler as it is really short right now):

class RandomUnderSampler(_Equalizer):
    def __init__(self, labels: torch.tensor):
        super().__init__(labels, "min")

class RandomOverSampler(_Equalizer):
    def __init__(self, labels):
        super().__init__(labels, "max")

Just pass in your labels to the __init__ (has to be 1D but can have multiple or binary classes) and you can up/under sample your data.

Hangchow answered 2/6, 2021 at 9:42 Comment(0)

Recommended topics

Hot tags