huggingface transformers classification using num_labels 1 vs 2
Asked Answered
P

2

5

question 1)

The answer to this question suggested that for a binary classification problem I could use num_labels as 1 (positive or not) or 2 (positive and negative). Is there any guideline regarding which setting is better? It seems that if we use 1 then probability would be calculated using sigmoid function and if we use 2 then probabilities would be calculated using softmax function.

question 2)

In both cases are my y labels going to be same? each data point will have 0 or 1 and not one hot encoding? For example, if I have 2 data points then y would be 0,1 and not [0,0],[0,1]

I have very unbalanced classification problem where class 1 is present only 2% of times. In my training data I am oversampling

question 3)

My data is in pandas dataframe and I am converting it to a dataset and creating y variable using below. How should I cast my y column - label if I am planning to use num_labels=1?

`train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))`
Placer answered 6/4, 2022 at 13:53 Comment(0)
C
11

Well, it probably is kind of late. But I want to point out one thing, according to the Hugging Face code, if you set num_labels = 1, it will actually trigger the regression modeling, and the loss function will be set to MSELoss(). You can find the code here.

Also, in their own tutorial, for a binary classification problem (IMDB, positive vs. negative), they set num_labels = 2.

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Here is the link.

Cavanagh answered 5/6, 2022 at 19:26 Comment(0)
M
4
  1. As answered here, the Sigmoid activation function is just a special case of 2-class Softmax activation function. With some weights set to zero, the second output is always zero. Thus for performance reasons like updating faster and having fewer parameters, you should use sigmoid.

  2. When your output dimension is one, one-hot encoding means assigning 0 to one class and 1 to the other. So for 2 data points, your y would be 0,1.

  3. ClassLabel is used to give names to integer labels that represent classes. So to use that, your y column should consist of zeros and ones. You can see in the example below that the ClassLabel column with two values is represented with one column consisting of 0 and 1.

PyTorch example:

from datasets import Dataset,ClassLabel
import pandas as pd
import torch

train_df = pd.DataFrame({'column':[1,2,3,4,5],'label':[0,1,0,1,0]})
train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos']))
train_dataset.set_format(type='torch', columns=['column', 'label'])
dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=5)
print(next(iter(dataloader)))

output:

{'column': tensor([1, 2, 3, 4, 5]), 'label': tensor([0, 1, 0, 1, 0])}
  • If your y column consists of neg and pos values, pandas would do the job as below:
label_mapping = {'neg':0,'pos':1}
train_df['label'] = train_df['label'].apply(lambda x:label_mapping['x'])
Macon answered 12/4, 2022 at 15:8 Comment(3)
thanks. But I dont think that your answer is specific to huggingface transformers libary. Correct me if you have tested your approach on the huggingface transformers libaryPlacer
I appreicate your help. I am still not clear. How would you change this line train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'])) to indicate num_classes=1?Placer
you don't need to change num_classes, cause you actually have 2 classes (neg and pos). but the point is, the datasets library one-hot encodes any ClassLabel column with 2 classes into a one-dimensional vector. in general, you can one-hot encode any column with n different labels using n-1-dimensional vectors.Macon

© 2022 - 2024 — McMap. All rights reserved.