How to get the logits of the model with a text classification pipeline from HuggingFace?
G

1

8

I need to use pipeline in order to get the tokenization and inference from the distilbert-base-uncased-finetuned-sst-2-english model over my dataset.

My data is a list of sentences, for recreation purposes we can assume it is:

texts = ["this is the first sentence", "of my data.", "In fact, thats not true,", "but we are going to assume it", "is"]

Before using pipeline, I was getting the logits from the model outputs like this:

with torch.no_grad():
     logits = model(**tokenized_test).logits

Now I have to use pipeline, so this is the way I'm getting the model's output:

 selected_model = "distilbert-base-uncased-finetuned-sst-2-english"
 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 model = AutoModelForSequenceClassification.from_pretrained(selected_model, num_labels=2)
 classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
 print(classifier(text))

which gives me:

[{'label': 'POSITIVE', 'score': 0.9746173024177551}, {'label': 'NEGATIVE', 'score': 0.5020197629928589}, {'label': 'NEGATIVE', 'score': 0.9995120763778687}, {'label': 'NEGATIVE', 'score': 0.9802979826927185}, {'label': 'POSITIVE', 'score': 0.9274746775627136}]

And I cant get the 'logits' field anymore.

Is there a way to get the logits instead of the label and score? Would a custom pipeline be the best and/or easiest way to do it?

Gosser answered 8/6, 2023 at 17:26 Comment(1)
I'm going to answer you but I hope you're being nice if I need more clarification. And this question is good and the context is clear since I know the other quesiton.Loutish
L
10

When you use the default pipeline, the postprocess function will usually take the softmax, e.g.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']

classifier(text, batch_size=2, truncation="only_first")

[out]:

[{'label': 'NEGATIVE', 'score': 0.9379090666770935},
 {'label': 'POSITIVE', 'score': 0.9990271329879761},
 {'label': 'NEGATIVE', 'score': 0.9726701378822327},
 {'label': 'NEGATIVE', 'score': 0.9965035915374756},
 {'label': 'NEGATIVE', 'score': 0.9913086891174316}]

So what you want is to overload the postprocess logic by inheriting from the pipeline.

To check which pipeline the classifier inherits do this:

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
type(classifier)

[out]:

transformers.pipelines.text_classification.TextClassificationPipeline

Now that you know the parent class of the task pipeline you want to use, now you can do this and still enjoy the perks of the precoded batching from TextClassificationPipeline:

from transformers import TextClassificationPipeline

class MarioThePlumber(TextClassificationPipeline):
    def postprocess(self, model_outputs):
        best_class = model_outputs["logits"]
        return best_class

pipe = MarioThePlumber(model=model, tokenizer=tokenizer)

pipe(text, batch_size=2, truncation="only_first")

[out]:

[tensor([[ 1.5094, -1.2056]]),
 tensor([[-3.4114,  3.5229]]),
 tensor([[ 1.8835, -1.6886]]),
 tensor([[ 3.0780, -2.5745]]),
 tensor([[ 2.5383, -2.1984]])]
Loutish answered 8/6, 2023 at 20:13 Comment(1)
That was it! Thanks! Nice class name :)Gosser

© 2022 - 2024 — McMap. All rights reserved.